reddit/app/YARS/README.md

173 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<div align="center">
<img src="logo.svg" width="10%">
# YARS (Yet Another Reddit Scraper)
[![GitHub stars](https://img.shields.io/github/stars/datavorous/yars.svg?style=social&label=Stars&style=plastic)](https://github.com/datavorous/yars/stargazers)<br>
</div>
YARS is a Python package designed to simplify the process of scraping Reddit for posts, comments, user data, and other media. The package also includes utility functions. It is built using **Python** and relies on the **requests** module for fetching data from Reddits public API. The scraper uses simple `.json` requests, avoiding the need for official Reddit API keys, making it lightweight and easy to use.
## Features
- **Reddit Search**: Search Reddit for posts using a keyword query.
- **Post Scraping**: Scrape post details, including title, body, and comments.
- **User Data Scraping**: Fetch recent activity (posts and comments) of a Reddit user.
- **Subreddit Posts Fetching**: Retrieve posts from specific subreddits with flexible options for category and time filters.
- **Image Downloading**: Download images from posts.
- **Results Display**: Utilize `Pygments` for colorful display of JSON-formatted results.
> [!WARNING]
> Use with rotating proxies, or Reddit might gift you with an IP ban.
> I could extract max 2552 posts at once from 'r/all' using this.
> [Here](https://files.catbox.moe/zdra2i.json) is a **7.1 MB JSON** file containing the top 100 posts from 'r/nosleep', which included post titles, body text, all comments and their replies, post scores, time of upload etc.
## Dependencies
- `requests`
- `Pygments`
## Installation
1. Clone the repository:
```
git clone https://github.com/datavorous/YARS.git
```
Navigate inside the ```src``` folder.
2. Install ```uv``` (if not already installed):
```
pip install uv
```
3. Run the application:
```
uv run example/example.py
```
It'll setup the virtual env, install the necessary packages and run the ```example.py``` program.
## Usage
We will use the following Python script to demonstrate the functionality of the scraper. The script includes:
- Searching Reddit
- Scraping post details
- Fetching user data
- Retrieving subreddit posts
- Downloading images from posts
#### Code Overview
```python
from yars import YARS
from utils import display_results, download_image
miner = YARS()
```
#### Step 1: Searching Reddit
The `search_reddit` method allows you to search Reddit using a query string. Here, we search for posts containing "OpenAI" and limit the results to 3 posts. The `display_results` function is used to present the results in a formatted way.
```python
search_results = miner.search_reddit("OpenAI", limit=3)
display_results(search_results, "SEARCH")
```
#### Step 2: Scraping Post Details
Next, we scrape details of a specific Reddit post by passing its permalink. If the post details are successfully retrieved, they are displayed using `display_results`. Otherwise, an error message is printed.
```python
permalink = "https://www.reddit.com/r/getdisciplined/comments/1frb5ib/what_single_health_test_or_practice_has/".split('reddit.com')[1]
post_details = miner.scrape_post_details(permalink)
if post_details:
display_results(post_details, "POST DATA")
else:
print("Failed to scrape post details.")
```
#### Step 3: Fetching User Data
We can also retrieve a Reddit users recent activity (posts and comments) using the `scrape_user_data` method. Here, we fetch data for the user `iamsecb` and limit the results to 2 items.
```python
user_data = miner.scrape_user_data("iamsecb", limit=2)
display_results(user_data, "USER DATA")
```
#### Step 4: Fetching Subreddit Posts
The `fetch_subreddit_posts` method retrieves posts from a specified subreddit. In this example, we fetch 11 top posts from the "generative" subreddit from the past week.
```python
subreddit_posts = miner.fetch_subreddit_posts("generative", limit=11, category="top", time_filter="week")
display_results(subreddit_posts, "EarthPorn SUBREDDIT New Posts")
```
#### Step 5: Downloading Images
For the posts retrieved from the subreddit, we try to download their associated images. The `download_image` function is used for this. If the post doesn't have an `image_url`, the thumbnail URL is used as a fallback.
```python
for z in range(3):
try:
image_url = subreddit_posts[z]["image_url"]
except:
image_url = subreddit_posts[z]["thumbnail_url"]
download_image(image_url)
```
### Complete Code Example
```python
from yars import YARS
from utils import display_results, download_image
miner = YARS()
# Search for posts related to "OpenAI"
search_results = miner.search_reddit("OpenAI", limit=3)
display_results(search_results, "SEARCH")
# Scrape post details using its permalink
permalink = "https://www.reddit.com/r/getdisciplined/comments/1frb5ib/what_single_health_test_or_practice_has/".split('reddit.com')[1]
post_details = miner.scrape_post_details(permalink)
if post_details:
display_results(post_details, "POST DATA")
else:
print("Failed to scrape post details.")
# Fetch recent activity of user "iamsecb"
user_data = miner.scrape_user_data("iamsecb", limit=2)
display_results(user_data, "USER DATA")
# Fetch top posts from the subreddit "generative" from the past week
subreddit_posts = miner.fetch_subreddit_posts("generative", limit=11, category="top", time_filter="week")
display_results(subreddit_posts, "EarthPorn SUBREDDIT New Posts")
# Download images from the fetched posts
for z in range(3):
try:
image_url = subreddit_posts[z]["image_url"]
except:
image_url = subreddit_posts[z]["thumbnail_url"]
download_image(image_url)
```
You can now use these techniques to explore and scrape data from Reddit programmatically.
## Contributing
Contributions are welcome! For feature requests, bug reports, or questions, please open an issue. If you would like to contribute code, please open a pull request with your changes.
### Our Notable Contributors
<a href="https://github.com/datavorous/yars/graphs/contributors">
<img src="https://contrib.rocks/image?repo=datavorous/yars" />