Scraping Reddit to find the most popular domains
Overview
Below, we'll show you how to scrape Reddit using Praw (Python Reddit API Wrapper). For this example, our goal will be to scrape the top submissions for the year across a few subreddits, storing the following: submission URL, domain (website URL), submission score. Ultimately, we want to be able to see which domains (urls) generate the highest scoring posts across a given subreddit.
1) Import packages, set up PRAW, select subreddits
Here we can set up our Praw credentials and select the list of subreddits we want to analyze.
#packages
import pandas as pd
import praw
import operator
import pandas as pd
#set up praw - setup here: http://praw.readthedocs.io/en/latest/getting_started/quick_start.html
reddit = praw.Reddit(client_id='my client id',
client_secret='my client secret',
user_agent='my user agent')
#create list of subreddits to include
s_list = \
[
'enter subreddits you want to include here as comma separated strings - e.g. 'news', 'datascience', etc']
2) Grab the score, domain (url), and subreddit for each top yearly submission
In this section we're looping through our array of subreddits from above, and storing the score, domain, and subreddit; we'll store each of these attributes in 3 separate dataframes, and merge together using the submission ID.
#set up dictionaries to store submission information
domains_sub = {}
domains = {}
domains_score = {}
domains_url = {}
#Loop through our selected list of subreddits
for i in s_list:
#--Grab the score for a given submission--#
#pull in top submissions for the year for subreddit specified in list above
subreddit = reddit.subreddit(i)
submissions = subreddit.top('year', limit=50)
#sum score across submissions
for s in submissions:
if s.id in domains_score.keys():
domains_score[s.id] += s.score
else:
domains_score[s.id] = s.score
df_score = pd.DataFrame.from_dict(domains_score, orient='index').reset_index()
df_score.columns = ['id','score']
#--Grab domain for given submission ID--#
subreddit = reddit.subreddit(i) #input('enter subreddit name: /r/'))
submissions = subreddit.top('year', limit=50)
for s in submissions:
if s.id in domains.keys():
domains[s.id] = s.domain
else:
domains[s.id] = s.domain
df_domain = pd.DataFrame.from_dict(domains, orient='index').reset_index()
df_domain.columns = ['id','domain']
#--Grab subreddit for given submission ID--#
subreddit = reddit.subreddit(i)
submissions = subreddit.top('year', limit=50)
for s in submissions:
if s.id in domains_sub.keys():
domains_sub[s.id] = s.subreddit.display_name
else:
domains_sub[s.id] = s.subreddit.display_name
df_subreddit = pd.DataFrame.from_dict(domains_sub, orient='index').reset_index()
df_subreddit.columns = ['id','subreddit']
Merge dataframes
Now that we have dataframes containing score, domain (url), and subreddit we can merge the three tables together, using submission ID as the primary key.
#merge the three tables together, using submission ID as primary key
df_sub_score = df_subreddit.merge(df_score, how='left', on="id")
df_final = df_sub_score.merge(df_domain, how='left', on='id')
# Add in submission URL using the 'id'
df_final['url'] = ['www.reddit.com/']+df_final['id'].astype(str)
df_final.head()
id | subreddit | score | domain | url | |
---|---|---|---|---|---|
0 | 78tulq | todayilearned | 42729 | atlasobscura.com | www.reddit.com/78tulq |
1 | 76bn5s | science | 25024 | ns.umich.edu | www.reddit.com/76bn5s |
2 | 7871xy | science | 30642 | acsh.org | www.reddit.com/7871xy |
3 | 77pnk6 | science | 13176 | jech.bmj.com | www.reddit.com/77pnk6 |
4 | 75eydj | gaming | 64510 | i.redd.it | www.reddit.com/75eydj |
Done! Explore the output
We now have a nice clean dataframe of the top yearly posts from each chosen subreddit, allowing us to see which domains racked up the highest total scores. I dumped the dataframe into a Google Sheet for you to explore.