JOINT_FORCES: Unite Competing Sentiment Classifiers with Random Forest

In this paper, we describe how we created a meta-classifier to detect the mes-sage-level sentiment of tweets. We participated in SemEval-2014 Task 9B by combining the results of several existing classifiers using a random forest. The results of 5 other teams from the competition as well as from 7 general-purpose commercial classifiers were used to train the algorithm. This way, we were able to get a boost of up to 3.24 F 1 score points.


Introduction
The interest in sentiment analysis grows as publicly available text content grows. As one of the most used social media platforms, Twitter provides its users a unique way of expressing themselves. Thus, sentiment analysis of tweets has become a hot research topic among academia and industry. In this paper, we describe our approach of combining multiple sentiment classifiers into a metaclassifier. The introduced system participated in SemEval-2014 Task 9: "Sentiment Analysis in Twitter, Subtask-B Message Polarity Classification" (Rosenthal et al., 2014). The goal was to classify a tweet on the message level using the three classes positive, negative, and neutral. The performance is measured using the macroaveraged F 1 score of the positive and negative classes which is simply named "F 1 score" throughout the paper. An almost identical task was already run in 2013 (Nakov et al., 2013). The tweets for training and development were only provided as tweet ids. A fraction (10-15%) of the tweets was no longer available on twitter, which makes the results of the competition not fully comparable. For testing, in addition to last year's data (tweets and SMS) new tweets and data from a surprise domain (LiveJournal) were provided. An overview of the provided data is shown in Table 1.
Using additional manually labelled data for training the algorithm was not allowed for a "constrained" submission. Submissions using additional data for training were marked as "unconstrained".  Our System. The results of 5 other teams from the competition as well as from 7 generalpurpose commercial classifiers were used to train our algorithm. Scientific subsystems were s_gez (Gezici et al., 2013), s_jag (Jaggi et al., 2014), s_mar (Marchand et al., 2013), s_fil (Filho and Pardo, 2013), s_gun (Günther and Furrer, 2013). They are all "constrained" and machine learningbased, some with hybrid rule-based approaches. Commercial subsystems were provided by This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: http://creativecommons.org/licenses/by/4.0/ Lymbix (c_lym), MLAnalyzer 1 (c_mla), Semantria (c_sem), Sentigem (c_snt), Syttle (c_sky), Text-Processing.com (c_txp), and Webknox (c_web). Subsystems c_txp and c_web are machine learning-based, c_sky is rule-based, and m_mla is a mix (other tools unknown). All subsystems were designed to handle tweets and further text types. Our submission included a subset of all classifiers including unconstrained ones, leading to an unconstrained submission. The 2014 winning team obtained an F 1 score of 70.96 on the Twit-ter2014 test set. Our approach was ranked on the 12th place out of the 50 participating submissions, with an F 1 score of 66.79. Our further rankings were 12th on the LiveJournal data, 12th on the SMS data, 12th on Twitter-2013, and 26th on Twitter Sarcasm. Improvement. Although our meta-classifier did not reach a top position in the competition, we were able to beat even the best single subsystem it was based on for almost all test sets (except sarcasm). In previous research we showed that same behaviour on different systems and data sets . This shows that also other systems from the competition, even best ones, probably can be improved using our approach.

Approach
Meta-Classifier. A meta-classifier is an approach to predict a classification given the individual results of other classifiers by combining them. A robust classifier, which can naturally handle categorical input such as sentiments by design, is the random forest classifier (Breiman, 2001). The algorithm uses the outputs of individual classifiers as features and the labels on the training data as input for training. Afterwards, in the test phase, the random forest makes predictions using the outputs of the same individual classifiers. We use the random forest implementation of the R-package "randomForest" and treat the three votes (negative, neutral, positive) as categorical input.
Training Data. To build a meta-classifier, first, one has to train all the subsystems with a dataset. Second, the meta-classifier has to be trained based on the output of the subsystems with a different dataset than the one used for training the 1 mashape.com/mlanalyzer/ml-analyzer subsystems. We decided to take the natural split of the data provided by the organizers (see Table  1). For the scientific subsystems we used the Training set to train on; for training the random forest classifier we used the Dev set. The commercial systems were used "as-is", in particular, we did not train them on any of the provided data sets. Table 2 shows the performance of the individual subsystems on the different data sets.

Experiments
There exist three obvious selections of subsystems for our meta-classifier: all subsystems, only scientific subsystems, and only commercial subsystems (called All_Subsystems, All_Scientific, and All_Commercial, respectively). Table 3 shows performance of these selections of subsystems on the data sets. For comparison, the table shows also the performance of the overall best individual subsystem in the first row. It turns out that All_Subsystems is almost always better than the best individual subsystem, while the other two meta-classifiers are inferior. Testing All Subsets. We performed a systematic evaluation on how the performance depends on the choice of a particular selection of individual subsystems. This resembles feature selection, which is a common task in machine learning, and As a general trend we see that the performance increases with the number of classifiers; however, there exist certain subsets which perform better than using all available classifiers.
Best Subset Selection. In Figure 1, we marked for each number of subsystems the highest OOB-F 1 -Score on the Dev set by a diamond. In addition, the subset with the overall highest OOB-F 1 -Score, consisting of 7 classifiers, is displayed as a filled diamond. We also evaluated the performance of these "best" subsets on other unseen test data. In Figure 2, we show the results of the test set Twit-ter2014. The scores for the very subsets marked in Figure 1 are displayed in the same way here. For comparison, we marked the performance of the system with all classifiers by a straight line. We find that all subsets that are "best" on the Dev set perform very well on the Twit-ter2014 set. In fact, some even beat the system with all classifiers. Similar behaviour can be observed for Twitter2013 and LiveJournal2014 (data not shown), while All_Subsets yields significantly superior results on SMS2013 (see Figure  3). No conclusive observation is possible for Sarcasm2014 (data not shown).
To elucidate on the question whether to use a subset with the highest OOB-F 1 on the Dev set (called Max_OOB_Subset) or to use all available classifiers, we show in Table 3 the performance of these systems on all test sets in rows 2 and 5, respectively. Since All_Systems is in 2 out of 5 cases the best classifier, and "Max_OOB_Subset" in 3 out of 5 cases, a decisive answer cannot be drawn. However, we find  Table 3: Performance (in F1 score) of meta-classifiers with different subsystems. The subset used in our submission is composed of s_gez, s_jag, s_mar, s_fil, s_gun, c_sma, c_sky, c_snt. "Max_OOB_Subset" is composed of s_jag, s_mar, s_gun, c_lym, c_sma, c_sky, c_txp. Bold shows best result per data set. The first row shows results of the best individual subsystem.
that All_Systems generalizes better to foreign types of data, while Max_OOB_Subset performs well on similar data (in this case, tweets).

Conclusion
We have shown that a meta-classifier approach using random forest can beat the performance of the individual sentiment classifiers it is based on. Typically, the more subsystems are used, the better the performance. However, there exist selections of only few subsystems that perform comparable to using all subsystems. In fact, a good selection strategy is to select the subset which has maximum out-of-bag F 1 score on the training data. This subset performs slightly better than All_Systems on similar data sets, and only slightly worse on new types of data. Advantage of this subset is that it requires less classifiers (7 instead of 12 in our case), which reduces the cost (runtime or license fees) of the meta-classifier.