Stock Sentiment on Twitter
In this project, I conduct sentiment analysis on tweets to determine if people feel positively or negatively about the stocks who are the biggest gainers and the biggest losers of the day.
About the dataset
To conduct this analysis, I first needed to see which stocks performed the best and which stocks performed the worst. Luckily, Yahoo Finance has various measures to determine stock performance. For simplicity’s sake, I choose my best and worst performing stocks based on the percent change from the previous day:
Best performing:
Medidata Solutions Inc. (MDSO) - an American technology company that develops and markets software as a service for clinical trials. Their software suite includes software for protocol development as well as software to capture patient data through web means.
Rite Aid (RAD) - the largest drugstore chain on the east coast, third in the U.S. The company was listed 94th in the 2019 Fortune 500 list in terms of total revenue.
Dropbox (DBX) - a web-based file hosting service. Users upload documents/items to Dropbox cloud servers that can be accessed by other users on personal client computers. This allows multiple users to access the most up-to-date files.
Worst performing:
GrafTech International (EAF) - a manufacturer of graphite electrode that are essential to the metal making process.
Guardant Health (GH) - an oncology company that works to help conquer cancer by providing data sets for advanced analytics.
LexinFintech Holdings (LX) - is a China based company that is an online consumer finance platform.
Data gathering
The data gathering process was split into four steps:
Scraping Twitter for tweets of each stock:
Search terms: Hashtags Company name, Mentions of the Company name and Cashtags
Adjusting the tweets into corpus objects (one for best performing and one for worst performing) which could be analyzed
Pre-processing the tweets to remove URLs, punctuation and stop words (as, is, the, etc.)
Transform the corpus objects into a Document-Term Matrix (DTM) to see a list of all of the words of all of the respective tweets.
the result
After all of the work to scrape Twitter for 100 tweets that mention each of the 6 companies, adjust those tweets into corpus objects in order to perform analyses and pre-processing the tweets to remove erroneous texts such as URLs, punctuation and stop words, and transforming these objects into a document-term matrix, the final result left me with a data set of words.
The DTM for the tweets about the best performing stocks is summarized with the below wordcloud:
The DTM for the tweets about the worst performing stocks is summarized with the below wordcloud:
The sentiment
Now the fun part. Once I had the DTM for each set of tweets, I began to conduct the sentiment analysis.
Step 1: I calculated the sentiment score of each tweet using the Bing lexicon. The Bing lexicon assigns a word either a +1 if the word is listed as a positive in the lexicon or a -1 if the word is listed as a negative in the lexicon. Once each word within the tweet was assigned a score, I summed the scores to produce a final score for the tweet.
Step 2: I combined all of the scores for each set of tweets and calculated summary statistics for the set of tweets as a whole.
The result shows that although the stocks making up each set of tweets performed drastically different, one set of stocks performed the best while the other set performed the worst, the sentiment towards those companies was relatively the same. The sentiment was neither overtly positive or negative.
In summary
The sentiment analysis produced unexpected results. Before I performed the analysis, I believed that the tweets for the best performing stocks would be positive overall while the tweets for the worst performing stocks would be negative. However, my results showed that the sentiment of the tweets for both the best and worst performing stocks were fairly similar and were neither overtly positive or overtly negative.
The contributing factors to the surprising result could be down to the dataset itself. The companies that were best and worst performing were relatively obscure to most people. This could possibly have led to non-opinionated tweets. In addition, the Bing lexicon only measures if a word is positive or negative. Various other lexicons can provide more detailed information. For example, the AFINN lexicon increases the range from -1/+1 from Bing to -5/+5. The nrc lexicon also broadens the score from just positive and negative to other categories such as anger, fear, joy or trust.