Emotion Detection in Social Media in Brazil

Forward

In this blog written by David de Hilster, he highlights his perspective as a mentor where he was able to provide guidance to his mentee Pedro Lima Rodrigues. Pedro completed his project on emotion detection in social media posts and submitted it for the annual HPCC Systems poster competition held every fall. Below David walks you through the project from conception to delivery.

How it began

Like many school or course projects, sentiment analysis is one of the “toy examples” given to students who are interested in learning natural language processing systems. “Toy” in the sense that these systems have fatal flaws that keep them from being used in real-world systems. This was not the case with the sentiment analyzer built by university student Pedro Lima Rodrigues at São Paulo University in Brazil. Pedro’s sentiment analyzer could be a part of a real-world sentiment analyzer that has the potential to be used by soccer websites. To do this, Pedro used the NLP++ plugin for HPCC Systems constructing the first sentiment analyzer in Portuguese using the NLP++ technology.

Having been in the area of practical NLP systems for four decades, it is frustrating for me to see so many “toy systems” in NLP and Machine Learning being built by students with them thinking that these systems are real or will serve them well in their future careers. What students are often taught in computer science are systems that do not translate to the real world. Sentiment analysis is no different. Students find ready-made systems or training sets for sentiment analysis and repeat the same computations made by everyone else in the world. And like everyone else in the world, the system they built has no chance of ever becoming a real-world system. This to me is a missed opportunity. We could be preparing our students to go out into the real world and produce useful and real-world software experience.

Luckily, Pedro had the chance to do just that using HPCC Systems to build a real-world practical system using the NLP++ plugin. HPCC Systems, the NLP++ plugin, and NLP++ are all real-world open source technologies that can create real-world software systems.

Objective

The objective of this project was quite interesting: map out the live emotions of a fanbase during a soccer game. You could imagine sentiment analyzers on a website on a soccer team’s page that track emotions during a game, just like they have the percentage change of the chance of winning. This would add a very interesting output for any sports website that runs live stats during a soccer match.

Data Set

Given that this was to be a real-world practical system, Pedro chose to gather tweets from fans during real Brazilian soccer matches. The criteria for selection were tweets that corresponded to a live game. In addition, the tweets were to involve the Brazilian team Sociedade Esportiva Palmeiras, or Palmeiras for short. Tweets were gathered during Palmeiras games and put into an HPCC Systems Database.

Pedro captured 100 tweets every 15 minutes during random Palmeira soccer matches.

Sentiment Definitions

Pedro looked at various taxonomies for sentiment but they were found to be too generic and not as applicable to this particular task which is to judge the emotions of soccer fans. An important difference between using NLP++ versus an existing generic sentiment analyzer is that the NLP++ analyzer can be 100% tailored to the particular sentiment analyzer.

Pedro analyzed the tweets and came up with an emotion taxonomy of his own. He came up with the following sentiment definitions:

  • Funny – things like laughing in Portuguese texting such as “kkkkkk”, “ksksks”, and “hahaha”.
  • Anger – swearing, talking about referees, and words such as “stolen” or “robbed”.
  • Support – phrases like “let’s go”, “come on”, or “go for it” or other words of encouragement. Also phrases of affection for the team like “our team” or “comeback team”.
  • Happiness – any word for “goal” or “score” and all its variations.

Implementation

One of the things I constantly repeat to students or programmers when programming in NLP++ is to “think like a human” when designing an NLP++ analyzer.

Besides emojis and words and phrases related to sentiment, there were a few other considerations to understanding these tweets. One is repeating letters. We all are familiar with the long-drawn-out soccer announcing “GGGGGOOOOOOOLLLLLLLL”. It turns out that repeating letters has a distinct pattern and normally happens with certain words. The number of letters is arbitrary so a special function was created to be able to recognize words such as “gol”.

Other interesting aspects of this analyzer were all the different names that fans call their team. These are very specific to the team and fanbase. For Palmeiras, some of the names used were “Big Green” (verdao), “pig” (porco) or pig emoji 🐖, and “palestra” (lecture) which comes from the historic name Palestra Italia for the team.

Emojis are an important part of sentiment analysis in social media. Pedro classified emojis using a set of NLP++ rules. NLP++ treats emojis just like words allowing for developing rules, dictionaries, or knowledge bases using emojis. (Since Pedro’s implementation, there are now emoji dictionaries and knowledge bases that comes standard with VisualText, the VSCode extension for NLP++.)

Here are some NLP++ rules in Pedro’s analyzer:

Results

For the time spent in creating this sentiment analyzer, the accuracy was quite good with minimal effort.

  • “support” was 93%
  • “funny” was 97%
  • “happiness” was 99%
  • “anger” was 74%

Pedro mentioned that more time would have to be spent on the sentiment for “anger” to get it also up to the 90% range. The difference between using the HPCC Systems NLP++ plugin instead of the Machine Learning bundle is that there is no training or large test set needed. NLP++ relies on the human programmer’s ability to write a generalized analyzer that can handle unseen tweets. It also allows for correctly identifying problems in a principle and logical way when problems do arise which is not the case with statistical methods such as when using machine learning, neural networks, or large language models.

Here are some actual tweets and the emotion detected in each.

Visualizations

I suggested to Pedro that if we were to graph these emotions on a timeline and overlaid a significant occurrence on top of the graph, it would trigger a change in emotions. After some work, Pedro succeeded and there was a clear correlation between a major event in the soccer game, to a shift in sentiment by the fanbase.

Here are the resulting graphs showing a clear correlation between sentiment changes and important events during the game:

Conclusion

NLP++ can be deceiving. It makes Pedro’s task at building a high-performing, tailored sentiment analyzer seem “too easy”. This is because real-world NLP tasks are specific and require very specific linguistic and world knowledge and NLP++’s linguistic and cognitive model allows for encoding this knowledge in a very compact and precise form. Using other programming languages such as Python, or JavaScript, or C++ would require building an infrastructure for text, rules, and knowledge. And when using generic NLP technologies that are out there, they are quickly abandoned when trying to do real-world, specific tasks such as Pedro’s sentiment analyzer of tweets for Palmeiras fans.

I asked Pedro at the end of this project: “Would this sentiment analyzer work for another team without modification?” His answer was an emphatic “NO”. He said that each team has very specific histories and vocabulary that are only known to their fanbase. And even though there is a portion of the sentiment analyzer that is common to all soccer matches, there is a vocabulary of slang and knowledge that each team has that would have to be coded into the sentiment analyzer for each team.

The HPCC Systems NLP++ plugin allows for fast development of the specific vocabulary and knowledge needed to add more teams to this project. And who knows, this could eventually lead to live, real-world, highly accurate sentiment analysis of tweets for soccer teams using the power of HPCC Systems and its NLP++ plugin – a match made in soccer heaven!

Links

Here are some links to Pedro’s work including the NLP++ code for the sentiment analyzer:

About the Author

David de Hilster is a Consulting Software Engineer on the HPCC Systems Development team and has been a computer scientist in the area of artificial intelligence and natural language processing (NLP) working for research institutions and groups in private industry and aerospace for more than 30 years. He earned both a B.S. in Mathematics and a M.A. in Linguistics from The Ohio State University. David has been with LexisNexis Risk Solutions since 2015 and his top responsibility is the ECL IDE, which he has been contributing to since 2016. David is one of the co-authors of the computer language NLP++, including its IDE VisualText for human language called NLP++ and VisualText.

David can frequently be found contributing to Facebook groups on natural language processing. In his spare time, David is active in several scientific endeavors and is an accomplished author, artist and filmmaker.