There’s rightly been a lot of attention paid to text mining. Text mining is the data analysis of natural language works (articles, books, etc.), using text as a form of data, joined with the numeric analysis.
Thus, I decided to analyze Google finance articles for the following American companies: Starbucks, Kraft, Wal-Mart and Mondelez.
Google Finance Articles
This allows me to retrieve the 20 most recent articles related to each stock.
A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens. I need to use “unnest_tokens” to break text into individual tokens and transform it to a tidy data, that is one-row-per-term-per-document:
Here we see all nouns, names that are important in these companies(articles). None of them occurred in all of the articles.
tf_idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
Visualize these high tf-idf words.
Visualize the top terms for each company individually.
As we have expected, the company names, stock symbols, some of companies’ products and executives are usually included, as well as companies’ latest movements such as Wal-Mart’s climate pledges.
To see whether the finance news coverage is positive or negative for these four companies, I opted to use AFINN lexicons which provides a positivity score for each word, from -5 (most negative) to 5 (most positive) to do a simple sentiment analysis.
If I am right then I can use the sentiment analysis to help make decision on my investment. But am I right?
The word “gross” is considered negative by AFINN lexicons, but it means “gross margin” in the context of finance articles. The word “share” and “shares” are neither positive nor negative in finance articles, but here AFINN lexicons count them as positive.
“tidytext” includes another sentiment lexicon - “loughran”, which was developed based on analyses of financial reports, and intentionally avoids words like “share” and “gross” that may not have a positive or negative meaning in a financial context.
The Loughran dictionary divides words into six sentiments: “positive”, “negative”, “litigious”, “uncertainty”, “constraining”, and “superfluous”.
This gives the most common words in the financial news articles associated with each of the six sentiments in the Loughran lexicon. Here I only get five sentiments, this indicates that there is no word can be associated with “superfluous” in recent Google finance news articles related to these four companies.
Now it makes much better sense and I can trust the results to count how frequently each sentiment was associated with each company in these articles.
Based the results, I’d say that in May 2017 most of the recent coverage on Walmart was strong negative and most of the recent coverage on Mondelez was positive. A quick search on the recent finance headlines suggests that I am on the right track.
The code to produce all this in R depends heavily on Julia Silge and David Robinson’s Text Mining with R book.