I recently finished a project using the dataset from Pump It Up: Mining the Water Table from DRIVENDATA. I’m going to walk through the Random Forest Classifier, one of the classifiers I tested, which was the one I found to perform the best after tuning its hyperparameters.
I won’t go into it here but there is a significant amount of data cleaning and feature selection to do before the data is ready for a model. …
I’m finishing up my Data Science Program at the Flatiron School, looking back at the five projects I did, and considering what ethical concerns should I have considered if the projects were going to be put into production and affect real people. While an introduction to machine learning ethics was part of the Flatron program, integrating an ethical analysis into each project was beyond its scope.
I will be using the five principles that Floridi and Cowls put forth in their 2019 paper A Unified Framework of Five Principles for AI in Society to do a top level review of my projects. The principles being beneficence, non-maleficence, autonomy, justice and explicability. Beneficence meaning “promoting well-being, preserving dignity, and sustaining the planet”. Non-maleficence meaning “privacy, security and ‘capability caution’”. Autonomy referring to the “balance between the decision -making power we retain for ourselves and that which we delegate to artificial agents”. Justice meaning “promoting prosperity, preserving solidarity, avoiding unfairness”. …
After finishing a recent Fake News Classification Project, I wanted to build a simple webapp that used my model. Because of their ease of implementation, I chose to build the app with Streamlit and deploy it with Heroku. While there is a lot more that can be done with both platforms, what follows are step by step instructions to get started with a simple project using my app as an example.
Please note that Streamlit requires Python 3.6 or later.
pip install streamlit
After installation is complete, test it out the installation with the built in ‘hello world’ app.
“It can’t be this easy.” That’s what I said to myself when getting started on my latest NLP project. I had seen some people claiming near perfect metrics with the dataset I had chosen for Fake News classification, but I assumed that they must have had to put in some effort. I soon found out that it was extremely simple to do it, and that the harder thing to do was to do worse. At first that may sound like a strange goal, but my ultimate goal was not to just tell the difference between the two classes in this dataset, but to have a model that generalizes well, and that may have to come at the price of lower accuracy, precision, and recall. …
When doing any Natural Language Processing (NLP) you will need to pre-process your data. In the following example I will be working with a Twitter dataset that is available from CrowdFlower and hosted on data.world.
There are many things to consider when choosing how to preprocess your text data, but before you do that you will need to familiarize yourself with your data. This dataset is provided in a .csv file; I loaded it into a dataframe and proceeded to review it.
dataframe = pd.read_csv(data_file)
Just looking at the first five rows, I can notice several things:
* In these five tweets the Twitter handles have been replaced with @mention
* They all have the hashtag #SXSW or #sxsw
* There is an html character reference for ampersand &
* There are some abbreviations: hrs, Fri
* There are some people’s real names, in this case public…
I recently completed a project involving a multivariate linear regression to predict housing prices, and guess what…people really don’t like waiting for the bathroom.
That seems pretty obvious, but how much don’t they like to wait? It turns out that they may be willing to pay tens of thousands of dollars to avoid a daily wait for the bathroom for the next 13 years¹.
I’m going to walk through how I came to the above conclusion about how patient Americans are when it comes to the bathroom.
I wrote my project in Python using a Jupyter notebook, and stored my data in a Pandas dataframe object. After cleaning my data and splitting it into training and test sets, I proceeded to make a linear regression model. I chose to use Statsmodels because I liked the summary report it produces. …
I recently finished a project to determine what the best performing movies at the box office are and make recommendations based on my findings. The definition of best was open to interpretation, it could have been profits, popularity, accolades, perceived social good, etc. or any combination those attributes. I chose to focus on profits because I believed that would be the real interest of the stakeholders.
Part of my analysis of movie profits involved the production budgets and profits of movies over the last 20 years.
While many datasets do not contain monetary data, if you find yourself analyzing a dataset with historical monetary data like I was, you will want to adjust those values for inflation. Before I show you how I adjusted the data for inflation, let’s look at the question of why bother adjusting for inflation. …