I recently finished a project using the dataset from Pump It Up: Mining the Water Table from DRIVENDATA. I’m going to walk through the Random Forest Classifier, one of the classifiers I tested, which was the one I found to perform the best after tuning its hyperparameters.

I won’t go into it here but there is a significant amount of data cleaning and feature selection to do before the data is ready for a model. …

Does Tableau fit into the exploratory data analysis portion of the data science life cycle? Or is it something to leave until the end when your findings need to be presented? Like most things in data science the answer is probably “depends”.

Last year I completed a project to use machine learning to classify the functional status of water points in Tanzania. A wrote about using a Random Forest Classifier with Imbalanced Data at the time.

EDA with Folium

An important element for understanding the dataset was the geospatial data for each water point. …

A well designed database and up to date documentation is the ideal, but if that is not the case and you are not in the position to improve the database design, what are some workarounds?

Consider the following scenario

You’re working as a freelance data scientist and are hired to do some analysis of a company’s data. That seems simple enough, you know how to use SQL and Python, and how to conduct hypothesis tests, and you are told that you’ll be given a database schema.

The company has been offering discounts at different levels to some customers. As your first task, you’re asked…

With the availability of the 2020 census data fast approaching followed by the apportionment of congressional and state legislative districts, I decided to take the time to familiarize myself with some of the data already available, its format, how to acquire it, and some of the obstacles.

New York’s 23rd Congressional District

To make the exercise manageable, I decided to focus on my home congressional district, New York’s 23rd. The current configuration of the 23rd district went into effect in 2013 after the redistricting process following the 2010 census. …

I’m finishing up my Data Science Program at the Flatiron School, looking back at the five projects I did, and considering what ethical concerns should I have considered if the projects were going to be put into production and affect real people. While an introduction to machine learning ethics was part of the Flatron program, integrating an ethical analysis into each project was beyond its scope.

Five Principles

I will be using the five principles that Floridi and Cowls put forth in their 2019 paper A Unified Framework of Five Principles for AI in Society to do a top level review of…

After finishing a recent Fake News Classification Project, I wanted to build a simple webapp that used my model. Because of their ease of implementation, I chose to build the app with Streamlit and deploy it with Heroku. While there is a lot more that can be done with both platforms, what follows are step by step instructions to get started with a simple project using my app as an example.

Install Streamlit

Please note that Streamlit requires Python 3.6 or later.

pip install streamlit

After installation is complete, test it out the installation with the built in ‘hello world’ app.


“It can’t be this easy.” That’s what I said to myself when getting started on my latest NLP project. I had seen some people claiming near perfect metrics with the dataset I had chosen for Fake News classification, but I assumed that they must have had to put in some effort. I soon found out that it was extremely simple to do it, and that the harder thing to do was to do worse. At first that may sound like a strange goal, but my ultimate goal was not to just tell the difference between the two classes in this…

When doing any Natural Language Processing (NLP) you will need to pre-process your data. In the following example I will be working with a Twitter dataset that is available from CrowdFlower and hosted on data.world.

Review the data

There are many things to consider when choosing how to preprocess your text data, but before you do that you will need to familiarize yourself with your data. This dataset is provided in a .csv file; I loaded it into a dataframe and proceeded to review it.

dataframe = pd.read_csv(data_file)

I recently completed a project involving a multivariate linear regression to predict housing prices, and guess what…people really don’t like waiting for the bathroom.

That seems pretty obvious, but how much don’t they like to wait? It turns out that they may be willing to pay tens of thousands of dollars to avoid a daily wait for the bathroom for the next 13 years¹.

I’m going to walk through how I came to the above conclusion about how patient Americans are when it comes to the bathroom.

Starting the Linear Regression

I wrote my project in Python using a Jupyter notebook, and stored my…

I recently finished a project to determine what the best performing movies at the box office are and make recommendations based on my findings. The definition of best was open to interpretation, it could have been profits, popularity, accolades, perceived social good, etc. or any combination those attributes. I chose to focus on profits because I believed that would be the real interest of the stakeholders.

Part of my analysis of movie profits involved the production budgets and profits of movies over the last 20 years.

While many datasets do not contain monetary data, if you find yourself analyzing a…

