In our last post we interpreted a data set with pandas to gain some insights from it. In this post, we will do the same, but instead of interpreting the raw data we will use visualizations to help us determine patterns in the data. But before we dive into the implementation, let’s review the benefits of visualizing data.
Why Visualize Your Data?
Visualizing data can help in the process of identifying patterns and anomalies that would otherwise be challenging to spot in raw data. If you have a data set that has a million rows, it will be tedious to analyze all that information line by line. Even sorting or filtering the data may not show anything out of the ordinary. But, when plotting the data, it is easier recognize outliers.
An outlier is a value in your data that is either extremely high or low in comparison with the other data.
Why does finding outliers matter? The outliers may affect how you get insights from your data and can lead to incorrect results. Let’s look at a quick example of how this can happen.
Finding Outliers
Let’s understand the importance of finding outliers by using a fictitious, but practical example. You’re working for your new company, AwesomeCo, and your boss gives you a data set of the last one thousand people who have subscribed to the company’s product and whether that customer has churned or not.
If a customer has churned, then that just means they have cancelled membership or is no longer active with a product.
You look through the data and you, being the creative data scientist that you are, wonder if there’s a relationship between if the customer churned and the customer’s annual income. We will be using a bit of statistics.
This example will assume
numpy
has been imported byimport numpy as np
.
From our data set let’s create a subset that lists incomes:
incomes = [50000, 55000, 60000, 44000, 42500]
With numpy
we can easily get the mean (the average of the items in our data) and median (the middle value of our data after it’s been sorted) of this data. (The formatting is just for the sake of reading the output easier.)
print("Mean - ${:8,.2f}".format(np.mean(incomes)))
print("Median - ${:8,.2f}".format(np.median(incomes)))
The mean and median here are roughly the same, which is what we expect. Now let’s add an outlier who earns way more than an average person and do the same calculations as above.
incomes.append(1000000)
print("Mean - ${:8,.2f}".format(np.mean(incomes)))
print("Median - ${:8,.2f}".format(np.median(incomes)))
After our outlier was introduced the mean was greatly affected. However, the median is still roughly the same as it was previously giving us a greater idea of an “average” income even with the outlier calculated in.
Once outliers are found, several options become available for how to manage them. One option is to exclude them. This is a valid choice when there is a small number of outliers and it is determined that they do not impact results. However, it may be determined conversely that including outliers provide useful insights depending and must be included. The decision is based on the results that need to be expressed.
Anscombe’s Quartet
The previous example was for why outliers matter in your data, but it doesn’t quite answer why it matters to visualize your data. A famous data set to show this is Anscombe’s Quartet. This is a set of four data sets that, if you look at the raw data or even look at the mean or other descriptive statistics, they all look the same. Let’s look at what each of these data sets looks like when visualizing them.
Plotting in Python
Matplotlib is one of the most used plotting packages in Python. It’s so popular pandas has it built right in. It’s also integrated within Jupyter Notebooks so plots can be outputted when evaluating Python code.
You may also see in other examples, including ours below, that a package called Seaborn is being imported when plotting. Seaborn is a wrapper on top of Matplotlib and it adds some enhancements such as having themes for plots to make them look prettier as well as offers more statistical plots.
While the plots in this post are bar charts (since we are dealing with just categorical data), Matplotlib can do much more than that. From line plots to contour plots. Matplotlib can even do animations!
Visualizing our Sales Data for Insights
Now that we know why it’s critical to visualize our data, let’s create visualizations for the sales data from our previous post.
To do that, we need to import the required libraries and load our data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
%matplotlib inline
We’re importing seaborn
here to improve the look of our graphs in matplotlib
. Also, we’re doing %matplotlib inline
to instruct our Jupyter notebook to create inline plots in our notebook output instead of creating a popup of the plot. This approach removes the need to write plt.show()
every time we want a plot to appear.
Next, data is loaded into pandas just like in the previous post.
df = pd.read_csv("./OfficeSupplies.csv")
Let’s take another look at our data using the head
function. The head
function displays our data in columns which helps in the process of determining what data we want to use for visualizations.
The first question we asked from our previous post is to see what rep sold the most units.
rep_plot = df_units.groupby("Rep").sum().plot(kind='bar')
rep_plot.set_xlabel("Rep")
rep_plot.set_ylabel("Units")
Using matplotlib
within pandas, we can do a group by “Rep” and get the sum of the values. Then using the plot
function, we indicate that we want a bar chart. The next two lines help describe what the graph is showing; they set the X-axis and Y-axis labels.
As we saw from the previous post, Richard sold the most units.
I also want to answer the other question we had in our previous post; who sold the most in total price? In order to do that, we need to create a new column and do our visualizations on the new data frame.
df["Total Price"] = df["Units"] * df["Unit Price"]
df.groupby("Rep").sum().sort_values("Total Price", ascending=False).plot(kind='bar')
Here we are doing the same query on the data, but now we’re adding the plot
command after it. We can see that since we sorted the values, Matthew earned slightly more than Susan did. We can also tell from this visualization that Matthew sold around the same as Susan, but what he did sell had a greater unit price than what Susan sold.
Speaking of units, let’s take a look at what unit has sold the most.
df_items = df[["Item", "Total Price"]]
df_items.groupby("Item").sum().plot(kind="bar")
Looks like binders are selling very well. Also, pen sets seem to be greatly outperforming pens and pencils. That could be an important detail to report back to the sales team to help them direct their marketing efforts towards those units that perform well or to increase their marketing on units that don’t sell well.
Now, let’s take a look at the regions. First, let’s find what region sold the most.
df_region = df[["Region", "Total Price"]]
df_region.groupby("Region").sum().plot(kind="bar")
Just as we saw from before, the Central region is outperforming the others. Now let’s visualize each rep in each region. (Big thanks to StackOverflow for assisting with a solution for this visualization).
# Get total values of 'Region' and 'Rep', then group by 'Total Price'.
group = df.groupby(["Region","Rep"]).sum()
total_price = group["Total Price"].groupby(level=0, group_keys=False)
gtp = total_price.nlargest(5)
ax = gtp.plot(kind="bar")
#draw lines and titles
count = gtp.groupby("Region").count()
cs = np.cumsum(count)
for i in range(len(count)):
title = count.index.values[i]
ax.axvline(cs[i]-.5, lw=0.8, color="k")
ax.text(cs[i]-(count[i]+1)/2., 1.02, title, ha="center",
transform=ax.get_xaxis_transform())
# shorten xticklabels
ax.set_xticklabels([l.get_text().split(", ")[1][:-1] for l in ax.get_xticklabels()])
Here we were able to group the reps into their respective regions and sort the bars based upon total price. We can see that the Central region has two to three times as many salespeople as the other regions. One reason for this could be that the Central region is geographically larger than the other region. Whatever the reason, it may still be worthwhile to inform the sales team about this finding. Small insights can still produce big results.
At this point, we have gone over the importance of visualizing our data. By using questions and some code from our previous post, we were able to perform visualizations to gain insights into our sales data. In the next post, we’ll go use our sales data again, but we will gain our insights much faster using Power BI.