Too Good to be True — NLP

Photo by Glenn Carstens-Peters on Unsplash

“It can’t be this easy.” That’s what I said to myself when getting started on my latest NLP project. I had seen some people claiming near perfect metrics with the dataset I had chosen for Fake News classification, but I assumed that they must have had to put in some effort. I soon found out that it was extremely simple to do it, and that the harder thing to do was to do worse. At first that may sound like a strange goal, but my ultimate goal was not to just tell the difference between the two classes in this dataset, but to have a model that generalizes well, and that may have to come at the price of lower accuracy, precision, and recall. What follows is the path I took.

The Initial Data

Red Flags

EDA and a Simple Heuristic

Percentage of Capital Letters in the Title
Percentage of Capital Letters in the Title

As you can see in the above visualization, there is very little overlap between the two classes. I could have a very high classification accuracy just using a heuristic based upon this observation. When splitting the news stories at 13% capital letters with less being Truthful and more being Fake, the overall accuracy turned out to be 98% with F1 scores of 0.98. You can’t get much better than that, I wouldn’t even need to use a machine learning model. This is where this project started to seem like it had results that were too good to be true (or at least generalizable).

While I could differentiate these two classes by simply looking at the percentage of capital letters in their titles, is it particular to this specific set of data? I first looked at the misclassified stories. The Fakes misclassified as Truthful primarily had the first letter of each word capitalized and capitalized acronyms. The Truthful misclassified as Fake were all short, only capitalized the first letter of the story and proper nouns, and had a lot of capitalized acronyms. The misclassified ones both had the same style and depending on the length of the title were getting a percentage of capitals above or below 13%.
Next I looked at the correctly classified stories. The Truthful ones only capitalized the first letter of the title and proper nouns and had minimal acronyms. The Fake ones were full of words in all capitals and every word was capitalized.

At this point I wondered if Reuters’ style of only capitalizing the first word of the title and then proper nouns was typical of what would be considered Truthful news. It didn’t take long come up with an answer of ‘no’, with the most obvious example of Truthful news that didn’t do that being the New York Times. Its style is to capitalize the first letter of every word. That being the case, this heuristic was not generalizable and I had to look for a better way to predict the class of a news story.

Machine Learning

I started with a bag of words model using a Naive Bayes Classifier that looked at the title and text separately and together, with and without lemmatization and with and without stopwords. All of which were returning overall accuracy and F1 scores of 0.95 to 0.98. Again I felt that these results were too good to be true and that the model is not generalized.

I next trained a Random Forest Classifier on the same combinations as the NB and got similar metrics, Random Forest also provides a list of the features it found to be most important. I was hoping it would give me some insight into what the model was focusing on. Some of the top features were “said”, “video”, “image”, “u.s”, “!”, and “twitter-handle”. The last of which was my place holder when I anonymized the twitter-handles. I didn’t know how all these words worked together to classify a news story, but whether people were being quoted, whether there were videos, images and twitter handles and where exclamation points were being used were very important. From looking at a sample of the news stories I new that videos and twitter handles and exclamation points were very typical of the Fake news in the dataset. The inclusion of “u.s.” might be indicative of one of the classes of news stories having more more stories about the U.S. vs international news or that one of the classes might just use US instead of U.S.

Maybe it is that easy to differentiate the two classes of news, at least in this dataset. But what if the news story was about a topic outside of this dataset or featured different people or organizations. I wanted to find a way to remove words referencing specific people, organizations, companies, or jargon. I first tried using a Named Entity Recognizer (NER) to identify the people in the dataset so that I could remove them and retrain the model. Unfortunately running the NER proved to be beyond the time and computing power I had available. As a proxy I decided to try training the model using only stopwords. They are often removed when doing NLP but by focussing only on the stopwords I hoped the model would be able to use the writing style of the news to differentiate the two classes. I found a larger set of stopwords on github that combined several common stopwords lists.

I trained a Random Forest Classifier on the Bag of Words using only the stopwords list and was still able to get an accuracy of 0.93 and a macro average F1 score of 0.93. This was a surprising but welcome result. Maybe I was onto something.

Using Bag of Words is pretty primitive so I tried the same stopwords only dataset using tf/idf again with a Random Forest. The results were marginally better.

Success, or so I thought

Only having a single source for the True stories still didn’t sit well with me though, so I looked for an easify available set of Truthful news stories that I could test my model on. I found a set of news stories for The Guardian that were also hosted on Kaggle. I selected only the stories that had to do with politics and started my test. To my dismay the results of the test were little better than random chance, as you can see below.

# 0 is Fake
# 1 is Truthful
(array([0, 1]), array([5495, 7099]))

I wasn’t expecting accuracy better than the model had done with its test set and probably a little worse since The Guardian is based in the UK even though it has plenty of coverage of events in the USA, but an accuracy of only 56% was unacceptable.

Back to the Drawing Board

Again I used a Random Forest Classifier using tf/idf with stopwords only. The accuracy and F1 scores had dropped to 0.89, but the trade off of accuracy vs. generalizability was worth it because I wanted to be able to use the model on stories from sources not included in the original dataset.

I re-tested my model on the full set of politics stories from The Guardian. This time the accuracy was 85%.

I was satisfied with the performance of my model given the time and hardware constraints of my project. If more of both of those were available I would want to expand my dataset and attempt a deep learning solution that could be even more accurate.

Conclusion

For additional information and my full code and project, check out the github repo.

Data Scientist with a background in Computer Science and as an Entrepreneur in the Bike industry — Based in Ithaca, NY