Hacker News is one of my favorite sites to catch up on technology and startup news, but navigating the minimalistic website can be sometimes tedious. Therefore, my plan in this post is to introduce you that how this social news site can be analyzed, in as non-technical a fashion as I can, as well as presenting some initial results, along with some ideas about where we will take it next.
To avoid dealing with SQL, I downloaded Hacker News dataset from David Robinson’s website, it includes one million Hacker News article titles from September 2013 to June 2017.
To begin, let’s look at the visualization of the most common words in Hacker News titles.
Some initial simple exploration
Before we get into the statistical analysis, the first step is to look at the most frequent words that appeared on Hacker News titles from September 2013 to June 2017.
For the most part, we would expect it is a fairly standard list of common words in Hacker News titles. The top word is “hn”, because “ask hn”, “show hn” are part of the social news site’s structure. The second most frequent words such as “google”, “data”, “app”, “web”, “startup” and so on are all within our expectation for a social news site like Hacker News.
Simple Sentiment Analysis
Let’s address the topic of sentiment analysis. Sentiment analysis detects the sentiment of a body of text in terms of polarity (positive or negative). When used, particularly at scale, it can show you how people feel towards the topics that are important to you.
We can analyze word counts that contribute to each sentiment. From the Hacker news articles, We fount out how much each word contributed to each sentiment
Word cloud is always a good idea to identify trends and patterns that would otherwise be unclear or difficult to see in a tabular format. We can also compare most frequent positive and negative words in word cloud.
Relationship between words
we often want to understand the relationship between words in a document. What sequences of words are common across text? Given a sequence of words, what word is most likely to follow? What words have the strongest relationship with each other? Therefore, many interesting text analysis are based on the relationships. When we exam pairs of two consecutive words, it is often called “bigrams”
Winner of most common bigram in Hacker news data goes to “machine learning” and the second is “silicon valley”.
The challenge in analyzing text data, as mentioned earlier, is in understanding what the words mean. The use of the word “deep” has different meaning if it is paired with the word “water” as opposed to the word “learning”. As a result, a simple summary of word counts in text data will likely be confusing unless the analysis relate it to the other words that also appear without assuming an independent process of word choice.
Networks of words
Words networks analysis is one method for encoding the relationships between words in a text and constructing a network of the linked words. This technique is based on the assumption that language and knowledge can be modeled as networks of words and the relations between them.
For Hacker news data, we can visualize some details of the text structure. For example, we can see pairs or triplets that form common short phrases (“Social media network” or “neural networks”).
This type of network analysis is mainly showing us the important nouns in a text, and how they are related. The resulting graph can be used to get a quick visual summary of the text, read the most relevant excerpts.
Once we have the capability to automatically derive insights from text analytics, they then can translate the insights into actions.
There is no structured survey data does a better job predicting customer behavior as well as actual voice of customer text comments and messages!
Code that created this post can be found here. I welcome your comments and questions!