Crawling all submissions from a subreddit

I am starting a new hobby project called word of mouth to recommend music groups based on suggestions from the r/ifyoulikeblank subreddit community. One of the first problem to solve is to download the submissions of the subreddit. This post goes over one possible option.

There are three options when it come to crawling reddit:

  • Use reddit rest api directly or through a python client like praw. The main downside of this approach is that we are limited to the top 100 submissions for each or our query. This makes it a no go when we want to download the content of a whole subreddit.

  • Leverage pushshift which offers data dumps of reddit and a rest api to slice and dice it. A popular python client for this api is psaw. The data dump is huge and we won't need most of it for our use case since we only need the content of a single subreddit. The rest api on top of it is an interesting option.

  • Use a third party crawler or write our own.

Given that pushshift has done all the crawling and indexing work already, we will leverage their api. The pushshift api is simple enough that we will use it directly through python's requests library.

Exploring the pushshift api

Let's get started with the pushshift api. It only takes a couple of lines of python code to get started:

import requests

url = "https://api.pushshift.io/reddit/search/submission"
params = {"subreddit": "ifyoulikeblank"}
submissions = requests.get(url, params = params)

We get back a list of 25 submissions looking like this:

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'pakoito',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_556z4',
 'author_patreon_flair': False,
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1564336546,
 'domain': 'self.ifyoulikeblank',
 'full_link': 'https://www.reddit.com/r/ifyoulikeblank/comments/ciyzhv/iil_madeon_porter_robinson_mystery_skulls/',
 'gildings': {},
 'id': 'ciyzhv',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 0,
 'num_crossposts': 0,
 'over_18': False,
 'parent_whitelist_status': 'all_ads',
 'permalink': '/r/ifyoulikeblank/comments/ciyzhv/iil_madeon_porter_robinson_mystery_skulls/',
 'pinned': False,
 'pwls': 6,
 'retrieved_on': 1564336548,
 'score': 1,
 'selftext': 'what other happy electronic music will I like?',
 'send_replies': True,
 'spoiler': False,
 'stickied': False,
 'subreddit': 'ifyoulikeblank',
 'subreddit_id': 't5_2sekf',
 'subreddit_subscribers': 150398,
 'subreddit_type': 'public',
 'thumbnail': 'self',
 'title': '[IIL] Madeon, Porter Robinson, Mystery Skulls',
 'total_awards_received': 0,
 'url': 'https://www.reddit.com/r/ifyoulikeblank/comments/ciyzhv/iil_madeon_porter_robinson_mystery_skulls/',
 'whitelist_status': 'all_ads',
 'wls': 6}

We didn't find any documentation for these fields, but their interpretation is straightforward.

This query only gave us the 25 most recent posts, what we want is to download all the submissions in this subreddit. We need paging which is done via the before and after parameters. We will set the before parameter to the creation date of the last item that was fetched:

last_submission_time = submissions.json()["data"][-1]["created_utc"]
params = {"subreddit" : "ifyoulikeblank", "before" : last_submission_time}
submissions = requests.get(url, params = params)

We now know how to fetch a subreddit page by page.

Putting it all together

If we put it all together, we can fetch all the submissions in a subreddit using the following crawl_page method:

import requests

url = "https://api.pushshift.io/reddit/search/submission"

def crawl_page(subreddit: str, last_page = None):
  """Crawl a page of results from a given subreddit.

  :param subreddit: The subreddit to crawl.
  :param last_page: The last downloaded page.

  :return: A page or results.
  """
  params = {"subreddit": subreddit, "size": 500, "sort": "desc", "sort_type": "created_utc"}
  if last_page is not None:
    if len(last_page) > 0:
      # resume from where we left at the last page
      params["before"] = last_page[-1]["created_utc"]
    else:
      # the last page was empty, we are past the last page
      return []
  results = requests.get(url, params)
  if not results.ok:
    # something wrong happened
    raise Exception("Server returned status code {}".format(results.status_code))
  return results.json()["data"]

The main loop would look something like this, just be careful not to hammer pushshift api with a flood of requests:

import time

def crawl_subreddit(subreddit, max_submissions = 2000):
  """
  Crawl submissions from a subreddit.

  :param subreddit: The subreddit to crawl.
  :param max_submissions: The maximum number of submissions to download.

  :return: A list of submissions.
  """
  submissions = []
  last_page = None
  while last_page != [] and len(submissions) < max_submissions:
    last_page = crawl_page(subreddit, last_page)
    submissions += last_page
    time.sleep(3)
  return submissions[:max_submissions]

Crawling the latest submissions of a subreddit is just a matter of calling:

lastest_submissions = crawl_subreddit("ifyoulikeblank")

Wrapping up

I can't thank enough the pushshift folks for their work. Using their rest api to download reddit content was as easy as it could be. If you plan to download a complete subreddit like we intend to do, just be careful about the volume of data to expect. You can get an idea by running the following query:

requests.get(url, params = {"subreddit": "ifyoulikeblank", "size": 0, "aggs" : "subreddit"}).json()["aggs"]

For the r/ifyoulikeblank subreddit, we are looking at 105+K submissions.

comments powered by Disqus