How to be funny on the internet

5 min readMay 4, 2021

The internet is a wild place where dozens of ideas are exchanged. Forums like Reddit present a valuable platform where digital interactions can be studied to extract individual and crowd behavioral patterns.

To take a fun spin on data analytics, today we will be looking at jokes posted on the R/Jokes subreddit to see if there are any patterns to a joke going viral.

Let us begin.

Data Scraping

To start, we must decide on what posts we want to scrape. To focus on the viral posts only, we will utilize Reddit’s filtering to look at the “Top” of “This Year”. So we must first scrape the URLs of the Reddit posts that appear when these filters are applied.

Universal Reddit Scraper (URS)

Using V3.2.0 of URS with the following command line input, we can get a CSV containing URLs to all the posts we need.

python .\urs\Urs.py -r Jokes T 25000 year — csv

RedditExtractoR

Used in junction with URS, RedditExtractoR’s reddit_content function allows us to scrape all the comments associated with a given post URL.

Results

When investigating the scraped data, I initially used “Top” of “All Time” to look at posts with outperforming engagement. However, these outliers skewed the analysis results greatly and when examined manually revealed them to be chronologically correlated to specific the United States or world news. Thus, to both reduce the amount of new-related outliers and scraping API calls, only posts within a year of May 2021 are used.

Word Clouds

First up are some word clouds to allow us to take a quick look at the corpus of the post title and content.

A quick glance shows what we expected, typical structure terms for setting up jokes (like “why”, “when”, “many”), and commonly used scenarios (like “bar”, “home”, “teacher”, “wife”, “husband”).

Word cloud of words correlated with high upvote ratio

Quick definition: upvote ratio is the split of upvotes to downvotes a post has received. For example, an upvote ratio of 66% indicates a 2-to-1 split of upvotes and downvotes.

Looking at the upvote ratio gives us an idea of “safe” words to use to cater to most viewers. Looks like jokes involving conversations between a couple with a focus on age and gender norms alongside American/Christian ideologies gather the most positive reactions.

Predicting Upvotes

Next, let us see if we can infer any specific features of a Reddit post that makes the joke successful. There are two dependent variables that we care about: the upvote count and the upvote ratio. The former will tell us the magnitude of virality that the joke experienced while the latter will tell us the overall controversialness of the joke by the upvote to downvote ratio.

Following are the independent variables used in the linear regression model: title length, post length, hour of day, day of week, month of year, sentiment, and topic category factor. All the time-related variables are extracted from the date_created field from URS. The sentiment is calculated using get_sentiment from syuzhet. The topic category factor is calculated using topicmodels’s LDA with k=10.

The first ten words of each topic category from LDA

The result of LDA shows some distinct style/theme differences between viral posts, some of which are consistent with the theme of couples from the word cloud like Topic 3, 4, 8, and 9.

Sidenote: posts for the LDA are filtered by if they were edited after initial posting to reduce the amount of words from people thanking for the upvotes and awards.

Finally, here are the model summaries for upvote count and the upvote ratio.

So we have some surprising results here, while the month of the year can still be attributed to world events, the hour of the day is as significant as some of the topic choices for your joke (like Topic 2 and 8).

Meanwhile, if we are only concern with the upvote ratio, then most of the categories are good (with the exception of Topic 5 and 7 which are current events and non-English word jokes).

So through training these models we have revealed the importance of certain variables while realizing the unimportance of some factors like the length of the jokes. Now, typically models will be tested against data from the same data source, but let’s use these models in an interesting way.

Predicting Jokes from Comments

A common saying amongst Redditors on R/Jokes is that the “real jokes are in the comments”. People frequently try to one-up the original poster by commenting a joke of a similar theme. Let’s see how funny these comments are and how they would have fared as a separate post by adapting the top comments of these viral posts into the linear regression models.

Sidenote: since comments don’t have a title and exact post time, the title length and post length will be the same, and hour of day will be defaut values

Table of sentiment, LDA, and linear model results from top comment and its parent post

A cursory glance might not reveal much about these comments, but let us then look at some aggregated measures of the comments relative to the post.

On average: 90% of the topics are different, 0.21 increase in sentiment, -0.038 difference in upvote ratio, and -15000 difference in upvotes.

Some inference can be made from these results, most of the time people are not responding with a joke, they have a more positive sentiment, a mild difference in the upvote ratio, and they may be able to get far more upvotes if they posted their comments as an individual post rather than a comment.

About:

Zhixin (Robin) Zhang is a senior in the Jerome Fisher Program in Management and Technology at the University of Pennsylvania studying CIS and OIDD. This data project was conducted for Professor Prasanna Tambe’s course, OIDD245: Analytics & The Digital Economy.