Reddit data in a pinch

Recursive Python script for Reddit

My thesis work involved collecting user-generated discourse about robo-advisors. Considering the popularity of personal finance-related conversations on Reddit, this seemed like the obvious platform from which to pull data. But immediately, I had a problem: how to actually do the collecting and preservation of data that is necessary for producing rigorous results?

Introduction

Social media sites have complex structures that make them difficult to use for research purposes. Unlike flat webpages where content can be easily saved as a PDF or even just copy/pasted out into a Word document, social media sites involve many different kinds of media and language that is embedded in posts, comments, advertisements, and notifications – just to name a few. These different formats make it complex to collect data from social media sites, yet this data is extremely valuable both for educational and commercial research.

Reddit specifically is an interesting environment to do research in. It’s simultaneously an ideal source for data – users are relatively anonymous, many posts and discussions can be viewed without having an account, and there is lots of data just waiting to be collected – but the structure of Reddit conversations can make data collection difficult. This is because of their tree-like structure, where users can reply to a reply to a reply to a reply (and on and on) of a comment. One of my biggest priorities as a qualitative researcher was to preserve this structure so I could get a sense of who was talking to whom. And while Reddit’s API is easily accessible, many of the scripts I found online wouldn’t preserve this conversational structure.

Initially, as a student with a background in humanities and social sciences, I was intimidated by using a computational tool and turned to front-end options like webpage snapshotting tools. I even considered simply copy-pasting comments into a Word document. But eventually it became clear that these ‘solutions’ were going to be painfully inadequate. So I dove into designing a script from scratch that could do what I needed: extract posts and comments easily while preserving the recursive reply structure.

The script I wrote in Python still has some limitations that I was fine with, considering the size of my project. I was only aiming to collect about 100 posts and their associated discussion threads, and I expected that even using Python, this would be a bit time-consuming. But overall, the script provides a more reliable and scalable solution than any of the other options I looked at.

How it works

This script can be used on a link to a Reddit post to produce a plain text (.txt) document that includes the post and the discussion thread. It uses the Python Reddit API Wrapper (PRAW).

To use this script, you’ll need to register to access Reddit’s API. Instructions for that are available here. In the script, there are notations to show you where you’ll need to enter your authentication information in order to be able to use the API. Make sure to look through the script and add in this information where necessary.

All you have to do is navigate to a Reddit post and copy the URL. When the script runs, it will prompt you twice in the shell. First, it returns a prompt asking for the URL, which you can paste in. Second, it returns a prompt asking you to name the file. Make sure to manually add in the .txt at the end so it produces a plain text document. As always, I recommend you decide on a naming convention and only use letters, numbers, and underscores in your file names to avoid any formatting issues. These prompts are a limitation to how scalable the script can be: it still takes a bit of time as you have to get to the post you want and do the file naming. For my purposes, I was fine with this minor inconvenience because I wasn’t collecting a ton of posts anyway, but this might be something you would want to adjust if you decide to use this script.

Another key limitation with the script is that on especially long discussion threads, it simply cannot collect all the comments and replies without taking an absurd amount of time. There is a section of the code that allows you to set how many ‘read more’ or ‘continue this thread’ buttons you want Python to open. I had this set at 5 for my project and found that this helped ensure that everything would be scraped pretty quickly (within a few seconds), but you may want to change that if you have more computational power or are needing a perfectly thorough collecting procedure.

If you’re new to programming and have no idea where to start, I would recommend getting your feet wet in Jupyter Notebooks or PyCharm. These are the two applications I used to learn Python. Jupyter is great because you can run individual lines of code to see if what you’re doing is right while you’re learning. Once you kind of understand how Python works, I’d suggest transitioning to something like PyCharm. It still includes helpful hints and prompts for learners, for example detecting when you’ve probably made a mistake and highlighting where your dependencies are missing. And PyCharm creates ‘virtual environments’ for your scripts which can help you avoid any issues where you are trying to run code that relies on conflicting Python packages.

I was introduced to Python during my Masters coursework but expanded my knowledge independently, so you should not be intimidated to dive in and see what you can come up with! When it doubt, Google your questions; there are lots of great people on Reddit and Stack Overflow that are asking similar ones and these forums can often help to get you unstuck.

The code

The script has two versions: one that includes some formatting meant to make qualitative analysis easier for a researcher, and another – the ‘clean’ version – that doesn’t include any extra formatting, good for feeding into a computational analysis. Both versions preserve the comment tree structure.

# Recursive Reddit Extractor
# Uses Python Reddit API Wrapper (PRAW)
# For extracting Reddit posts and comments and transforming them into a .txt file that preserves tree structure
# This code should prompt for a Reddit post URL and then prompt for a file name (e.g. testerfile.txt), in the shell
# Result: a .txt file will be saved in the folder where this code is

# Import libraries
import sys
import textwrap
import datetime
import praw

# Authentication

reddit = praw.Reddit(
client_id = # Your client ID,
client_secret = # Your client secret,
user_agent = # Your app name,
)

# Main section of the code

# Prompt for submission URL
url = input(‘Enter URL here: ‘)
submission = reddit.submission(url=url)

# Prompt for file name, make sure to end with .txt manually
filename = input(‘What do you want to call this file? ‘)
sys.stdout = open(filename,’a’)

# Expand “see more comments” buttons. Set lower limit to speed up processing time, with the trade-off that this will lower the thoroughness.
submission.comments.replace_more(limit=5)

# FUNCTIONS
# Spacing function for printing
def twospaces():
print()
print()

# Recursive function for comment processing
def handle_comment(comment, depth = 2):
for child_comment in comment.replies:
print()
print(depth * ‘ ‘, ‘__’, ‘Reply: Level’, depth, ‘__’) # This will show the comment level

# Text-wrapping comment body
childcommenttext = child_comment.body
lines = textwrap.wrap(childcommenttext, width = 50)

for line in lines:
print(depth * ‘ ‘, line)

# Comment data
# If commenter is the original poster, it will say “By submitter”:
if child_comment.is_submitter is True:
print(depth * ‘ ‘, ‘///’, ‘By submitter.’)
# Author, comment and parent ID to be able to manually check
print(depth * ‘ ‘, ‘///’, ‘Author:’, child_comment.author)
print(depth * ‘ ‘, ‘///’, ‘Comment ID:’, child_comment.id)
print(depth * ‘ ‘, ‘///’, ‘Parent ID:’, child_comment.parent_id)

# Time, converted from Unix to UTC
timestamp = child_comment.created_utc
value = datetime.datetime.fromtimestamp(timestamp)
print(depth * ‘ ‘, ‘///’, ‘Time:’, f”{value:%Y-%m-%d %H:%M:%S}”)

# Recursive element
handle_comment(child_comment, depth+1)

# Printing the submission

print(‘___Submission___’)
print()
print(‘-Submission title-‘)

# Textwrapping the submission title
submissiontitle = submission.title
lines = textwrap.wrap(submissiontitle, width = 60)
for line in lines:
print(line)

print()

# Textwrapping the submission body
print(‘-Submission body-‘)
submissionbody = submission.selftext
lines = textwrap.wrap(submissionbody, width = 60)
for line in lines:
print(line)

# Submission data
# This shows whether submission is text-only. If not, it includes this warning.
if submission.is_self is False:
print(‘Note: This submission may include other media’)
print()
print(‘—–‘)
# Prints submission author
print(‘Author:’, submission.author)

# Time of submission, converted from Unix to UTC
timestamp = submission.created_utc
value = datetime.datetime.fromtimestamp(timestamp)
print(‘Time:’, f”{value:%Y-%m-%d %H:%M:%S}”)

# Submission ID, comments, other info
print(‘ID:’, submission.id)
print(‘# of Comments:’, submission.num_comments, ‘(Comments printed may be less)’)
print(‘Permalink:’, submission.permalink)
print(‘Other links in submission:’, submission.url, ‘(Duplicate permalink if no other links)’)
print(‘—–‘)
twospaces()

# Printing the comments
print(‘—–BEGINNING OF COMMENTS—–‘)

for top_level_comment in submission.comments:
twospaces()
print(‘___Comment___’)

# Textwrapping the comment body
toplevelcommenttext = top_level_comment.body
lines = textwrap.wrap(toplevelcommenttext, width = 60)

for line in lines:
print(line)

print(‘—–‘)

# Comment data
if top_level_comment.is_submitter is True:
print(‘///’, ‘By submitter.’)
print(‘///’, ‘Author:’, top_level_comment.author)
print(‘///’, ‘Parent ID (should be submission ID):’, top_level_comment.parent_id)
print(‘///’, ‘Comment ID:’, top_level_comment.id)

# Time of comment, converted from Unix to UTC
timestamp = top_level_comment.created_utc
value = datetime.datetime.fromtimestamp(timestamp)
print(‘///’, ‘Time:’, f”{value:%Y-%m-%d %H:%M:%S}”)

handle_comment(top_level_comment)

# Close the file
sys.stdout.close()

# This is a version of the Reddit script that will print clean text
# Headings and notations about user names, time posted, etc. are left out
# The purpose is that this can be fed directly into computational analysis

# Import libraries
import sys
import datetime
import praw

# Authentication
reddit = praw.Reddit(
client_id = # Your client ID,
client_secret = # Your client secret,
user_agent = # Your app name,
)

# Main section of the code

# Prompt for submission URL
url = input(‘Enter URL here: ‘)
submission = reddit.submission(url=url)

# Prompt for file name, end with .txt manually
filename = input(‘What do you want to call this file? ‘)
sys.stdout = open(filename,’a’)

# Expand “see more comments” buttons. Set lower limit to speed up processing time, with the trade-off that this will lower the thoroughness.
submission.comments.replace_more(limit=5)

# FUNCTIONS
# Spacing function for printing
def twospaces():
    print()
    print()

# Recursive function for comment processing
def handle_comment(comment, depth = 2):
    for child_comment in comment.replies:
        print()
        print(depth * ‘  ‘)
        childcommenttext = child_comment.body
        print(childcommenttext)

        # Recursive element
        handle_comment(child_comment, depth+1)

# Printing the submission

print()

# Submission title
submissiontitle = submission.title
print(submissiontitle)

print()

# Submission body
submissionbody = submission.selftext
print(submissionbody)

twospaces()

# Printing the comments

for top_level_comment in submission.comments:
    twospaces()
    toplevelcommenttext = top_level_comment.body
    print(toplevelcommenttext)

    handle_comment(top_level_comment)

# Close the file
sys.stdout.close()

All dependencies current as of January 14, 2022.

certifi==2021.5.30
charset-normalizer==2.0.3
DateTime==4.3
docopt==0.6.2
idna==3.2
pipreqs==0.4.10
praw==7.5.0
prawcore==2.2.0
pytz==2021.1
requests==2.26.0
update-checker==0.18.0
urllib3==1.26.6
websocket-client==1.1.0
yarg==0.1.9
zope.interface==5.4.0