HappyTweet So Far

Iteration one was the simplest sentiment analysis algorithm I could think of. It looked up a sentiment score for each word from a database and added the scores of all the words in a tweet. The initial results were not great.

Actual Score	Correct	Incorrect	Total	Percent Correct
Positive	124	58	182	68%
Negative	46	131	177	26%
Neutral	60	79	139	43%
Total	230	268	498	46%

It felt like maybe there weren’t that many words that were actually being scored but that was just a hunch. Collecting data proved out that the hunch was correct and only 933 of the 5759 words from the tweets matched scored words from the table.

Stemming

Stemming is a way to reduce a word form down to its root form, or stem. The hope was that by comparing the word stems, there would be more matches and the results would improve.

There were several existing choices of libraries to do this. I used clj-tokenizer to stem both the list of scored words and the words in the tweets which increased the number of words which actually contributed to the scoring from 933 up to 1636.

Overall, scoring more words had a positive impact, but it did produce some curious results.

Actual Score	Correct	Incorrect	Total	Percent Correct
Positive	147	35	182	81%
Negative	57	120	177	32%
Neutral	37	102	139	26%
Total	241	257	498	48%

The curious thing is the drop in scoring the neutral tweets. Also, the number of positive tweets scored correctly is much higher than either negative or neutral tweets. I have a guess as to why the neutral tweets went lower, but will need some data to see if it is correct.

What’s Next

At this point, there are more questions being generated than answers. I’ll probably research the drop in neutral tweet correctness.

Adding the second algorithm resulted in a lot of duplicated code in the runner program. Before adding any other algorithms, I’d like to refactor that code to remove the duplication. There is also duplicated code between the two algorithms. As new algorithms are added, I want to be able to continue to run the old ones. That means I can’t just improve the code as I go along. In an Object Oriented world, inheritance would solve that problem. I’ll have to learn how to do that in a functional world.

The next algorithm change I’d like to look at involves using word vectors to find similar words for words that are not in the scored words table.