Too Good to be True — NLP
“It can’t be this easy.” That’s what I said to myself when getting started on my latest NLP project. I had seen some people claiming near perfect metrics with the dataset I had chosen for Fake News classification, but I assumed that they must have had to put in some effort. I soon found out that it was extremely simple to do it, and that the harder thing to do was to do worse. At first that may sound like a strange goal, but my ultimate goal was not to just tell the difference between the two classes in this dataset, but to have a model that generalizes well, and that may have to come at the price of lower accuracy, precision, and recall. What follows is the path I took.
The Initial Data
On Kaggle I found the “Fake and Real News Dataset” which is a dataset that had been put together by the University of Victoria ISOT Research Lab. It contains Fake News from websites that had been identified by Politifact and Wikipedia as unreliable, (unfortunately it didn’t contain which websites), and Truthful News from a single source, Routers.
First, not knowing what websites the Fake News came from was disappointing, that information could have been helpful as part of the classification process, but potentially more importantly I had no way of knowing how representative the Fake News stories are of Fake News.
Second, having true news stories from only a single source, Reuters, seemed like it could lead to having an overfit model. Nevertheless, it seemed like a legitimate dataset and I wanted to check out the data and train a model.
EDA and a Simple Heuristic
During the EDA process, I looked at the length of the stories and titles, the number of capitals letters in the titles, and the percentage of capital letters in the titles. What stood out right away was the distribution of the percentage of capital letters for both classes in the news titles.
As you can see in the above visualization, there is very little overlap between the two classes. I could have a very high classification accuracy just using a heuristic based upon this observation. When splitting the news stories at 13% capital letters with less being Truthful and more being Fake, the overall accuracy turned out to be 98% with F1 scores of 0.98. You can’t get much better than that, I wouldn’t even need to use a machine learning model. This is where this project started to seem like it had results that were too good to be true (or at least generalizable).
While I could differentiate these two classes by simply looking at the percentage of capital letters in their titles, is it particular to this specific set of data? I first looked at the misclassified stories. The Fakes misclassified as Truthful primarily had the first letter of each word capitalized and capitalized acronyms. The Truthful misclassified as Fake were all short, only capitalized the first letter of the story and proper nouns, and had a lot of capitalized acronyms. The misclassified ones both had the same style and depending on the length of the title were getting a percentage of capitals above or below 13%.
Next I looked at the correctly classified stories. The Truthful ones only capitalized the first letter of the title and proper nouns and had minimal acronyms. The Fake ones were full of words in all capitals and every word was capitalized.
At this point I wondered if Reuters’ style of only capitalizing the first word of the title and then proper nouns was typical of what would be considered Truthful news. It didn’t take long come up with an answer of ‘no’, with the most obvious example of Truthful news that didn’t do that being the New York Times. Its style is to capitalize the first letter of every word. That being the case, this heuristic was not generalizable and I had to look for a better way to predict the class of a news story.
Given that looking at capital letters did not generalize well outside this dataset, I need to normalize the text as best as possible to try to find a generalizable model. I set all the letters to lowercase, removed numbers and dates, removed references to Reuters that were in the text of almost every Truthful story, removed URLs, anonymized twitter handles, and removed ‘s from the Truthful stories because all the apostrophes had been removed from the Fake ones before the dataset was published.
I started with a bag of words model using a Naive Bayes Classifier that looked at the title and text separately and together, with and without lemmatization and with and without stopwords. All of which were returning overall accuracy and F1 scores of 0.95 to 0.98. Again I felt that these results were too good to be true and that the model is not generalized.
I next trained a Random Forest Classifier on the same combinations as the NB and got similar metrics, Random Forest also provides a list of the features it found to be most important. I was hoping it would give me some insight into what the model was focusing on. Some of the top features were “said”, “video”, “image”, “u.s”, “!”, and “twitter-handle”. The last of which was my place holder when I anonymized the twitter-handles. I didn’t know how all these words worked together to classify a news story, but whether people were being quoted, whether there were videos, images and twitter handles and where exclamation points were being used were very important. From looking at a sample of the news stories I new that videos and twitter handles and exclamation points were very typical of the Fake news in the dataset. The inclusion of “u.s.” might be indicative of one of the classes of news stories having more more stories about the U.S. vs international news or that one of the classes might just use US instead of U.S.
Maybe it is that easy to differentiate the two classes of news, at least in this dataset. But what if the news story was about a topic outside of this dataset or featured different people or organizations. I wanted to find a way to remove words referencing specific people, organizations, companies, or jargon. I first tried using a Named Entity Recognizer (NER) to identify the people in the dataset so that I could remove them and retrain the model. Unfortunately running the NER proved to be beyond the time and computing power I had available. As a proxy I decided to try training the model using only stopwords. They are often removed when doing NLP but by focussing only on the stopwords I hoped the model would be able to use the writing style of the news to differentiate the two classes. I found a larger set of stopwords on github that combined several common stopwords lists.
I trained a Random Forest Classifier on the Bag of Words using only the stopwords list and was still able to get an accuracy of 0.93 and a macro average F1 score of 0.93. This was a surprising but welcome result. Maybe I was onto something.
Using Bag of Words is pretty primitive so I tried the same stopwords only dataset using tf/idf again with a Random Forest. The results were marginally better.
Success, or so I thought
At this point I was pretty satisfied with my model. It had respectable accuracy and F1 scores and didn’t use any people, places, events or jargon during classification, so it must generalize well.
Only having a single source for the True stories still didn’t sit well with me though, so I looked for an easify available set of Truthful news stories that I could test my model on. I found a set of news stories for The Guardian that were also hosted on Kaggle. I selected only the stories that had to do with politics and started my test. To my dismay the results of the test were little better than random chance, as you can see below.
# 0 is Fake
# 1 is Truthful(array([0, 1]), array([5495, 7099]))
I wasn’t expecting accuracy better than the model had done with its test set and probably a little worse since The Guardian is based in the UK even though it has plenty of coverage of events in the USA, but an accuracy of only 56% was unacceptable.
Back to the Drawing Board
My model didn’t do very well on the outside data, but I now had additional data that I could incorporate into my model. For the Truthful portion of my new training set I used a 50/50 split of random samples of the Reuters news and the Guardian news. The size of the Truthful class was equal to the size of the Fake class to keep things balanced.
Again I used a Random Forest Classifier using tf/idf with stopwords only. The accuracy and F1 scores had dropped to 0.89, but the trade off of accuracy vs. generalizability was worth it because I wanted to be able to use the model on stories from sources not included in the original dataset.
I re-tested my model on the full set of politics stories from The Guardian. This time the accuracy was 85%.
I was satisfied with the performance of my model given the time and hardware constraints of my project. If more of both of those were available I would want to expand my dataset and attempt a deep learning solution that could be even more accurate.
If it seems too easy it probably is, and you need to find out why it’s too easy. In my case not having representative data resulted in a model that overfit the initial dataset. Luckily some additional data was readily available so I could test and then improve my model.
For additional information and my full code and project, check out the github repo.