by Flavio Villanustre on 04/16/2012
A couple of days ago, I was discussing with a few colleagues how to get a small project on sentiment analysis going. I was explaining that the HPCC Systems platform, with its ECL-ML (ECL Machine Learning) libraries, currently has a good set of tools to get a large chunk of the work done, effortlessly.
When asked how to get started, I indicated that there are a few good public labeled corpus repositories that could be used to train a classifier, such as this one, or this one, or even this one. Of course, if you feel courageous enough, you could mine a corpus yourself from, for example, Twitter, following these guidelines.
But having a good training corpus is only the beginning of the story. You'll still need to decide how you'll architect your features and which classifier you will use, and there are some tricks of the trade that always need to be considered, if you want reasonable accuracy.
I would recommend, before doing anything else, taking a look at the latest version (1.2b) of the ECL-ML guide (yes, this is the docbook source, as the PDF for this version should be coming out this week, or possibly next), and particularly at the "Using ML with documents (ML.Docs)" section. Several enhancements and new functionality has been added since ECL-ML version 1.0, so I strongly encourage using the latest version of the libraries.
When you get to the definition of features, there are a couple of tricks that tend to be quite useful, specifically around the treatment of negations and semantically irrelevant frequent words, and also the use of Laplacian smoothing.
When dealing with monograms and short n-grams, the inverse connotation created by negation can get non-obvious for the classifier, unless context is provided for the entire phrase. For example, a phrase that goes: "This movie isn't really good" could appear as positive, if the classifier ignores the fact that "isn't" is negating "good". For the English language, one simple trick is to parse the phrases and append "NOT_" to the beginning of every word appearing after "not" and "n't". The net effect is potentially doubling the dictionary size, but can accurately now differentiate a "NOT_good" from a "good" and count the sentiment appropriately.
If you are planning to use Naive Bayes as the classifier (this applies to other classifiers too), and since, unfortunately, your training set will never contain every word that you will ever see, it's good practice to use Laplacian smoothing (multiplying by 0 is never good :) ), and reserve a little bucket of probabilities for those unseen words.
And finally you'll need to decide on what classifier to use. While Naive Bayes tends to be an easy choice (and there is an excellent Naive Bayes implementation as part of the ECL-ML libraries), the strong independence assumption that Naive Bayes makes can lead to less than optimal accuracy. Other classifiers, such as Logistic Regression, Support Vector Machines and perceptrons usually make for a better choice. Good implementations of these algorithms are also part of the ECL-ML libraries, and utilizing any one of them from within HPCC is extremely simple (please refer to the documentation on ECL-ML linked above).
I hope this short guide serves as a good introduction to sentiment analysis and text classification on HPCC, and don't forget to take a look at the ECL-ML portal at: http://hpccsystems.com/ml.