Does Tableau fit into the exploratory data analysis portion of the data science life cycle? Or is it something to leave until the end when your findings need to be presented? Like most things in data science the answer is probably “depends”.
Last year I completed a project to use machine learning to classify the functional status of water points in Tanzania. A wrote about using a Random Forest Classifier with Imbalanced Data at the time.
EDA with Folium
An important element for understanding the dataset was the geospatial data for each water point. At the time, I used the Python package Folium to display the water points on a map and filtered the data using different parameters.
For example below I am viewing a heatmap of water points that had a quantity value of dry.
from folium import plugins
import numpy as np
import pandas as pd# filter the dataframe
df = data[data['quantity'] == 'dry']]# create a matrix of data points for the heat map
lat_long_matrix = df[['latitude', 'longitude']].to_numpy()# create a folium map centered on Tanzania
map_ = folium.Map([-6.369, 34.8888], zoom_start=5)# add a heatmap layer the map
While this was a quick way to get a visual understanding of some of the data right within my Jupyter notebook, embedding multiple maps within a notebook used a lot of memory and slowed down the notebook considerably.
Also, given the amount of data and folium’s inability to handle large amounts of markers, I had no way to place color coded markers on a map to show the distribution of the labels for a given segment of the data.
EDA with Tableau
What would have been nice is if I could have made a map like the one above, but that instead of a heatmap, it had color coded marks that gave a easy to understand visualization of the each label for any of the potential values of the quantity parameter. It turns out that is fairly simple to do with Tableau, which can handle mapping larger amounts of data.
If you have access to a paid version of Tableau, great, but this can also be done with the free Tableau Public, given that you are alright with saving your visualisations publicly.
This requires a few steps, but is very straightforward. If you are completely unfamiliar with Tableau, I suggest doing some of their tutorials if you intend to follow along.
First Add a data connection, in this case a text file because my data is stored in a .csv file.
Once you’ve added the connection, your data will show up as a table in the lower right hand side.
I already cleaned and processed the data, so it’s all ready to be used in a visualization. Next add a worksheet, or use the blank one provided when you first start a project.
Next, put the latitude and longitude data in rows and columns, status_group as a color mark and quantity as a filter.
You can see all the columns from the dataset in the left column. In the above image I added several other filters besides the Quantity one that I’m focussing on now.
After setting the quantity filter to dry, which will give me the same subset of data as I did with the folium map above, the resulting map is below.
In this case red is non-functional, orange is functional needs repairs and blue is functional. As would be expected, almost all the dry water points are non-functional, but the few that are functional could use further investigation.
I could then easily switch the filter to enough do see the geographic distribution of though water points without any additional setup or resources.
Enough is 57% of the water points, so it’s hard to see any detail, but you can easily zoom in. Below I’ve zoomed into the area around Mt. Kilimanjaro where you can see that the eastern side of the mountain is dominated by non-functional wells.
Doing some map based EDA in Tableau allows for a greater understanding of the data than can be done with Python in a Jupyter notebook. The downside is that this part of the EDA is separate from your Jupyter notebooks. Although references and/or screen grabs could be integrated into your notebooks to provide a cohesive story along with your analysis.
Thanks for reading this article, I hope you found it helpful.