I am getting a little behind in writing future posts. I got an invitation to take part in Kaggle’s 30 Days of ML Daily Assignments. I have got the easy stuff done. Specifically, the courses: Python, Intro to Machine Learning, Intermediate Machine Learning. On-line courses tend to be relatively easy to complete. It’s learning enough from them that is the hard part. I don’t think I have done that well with the last two. The hard part, completing a machine learning competition, is yet to come. But, I have spent a bit of time completing all the exercises and this blog has suffered. Hopefully, I don’t have to miss a weekly posting or more.

Where to Now

Last time I wasn’t too sure what to look at next. I did briefly consider having a look at what help pandas could give working with time series. Afterall, time series analysis has been a big part of business modelling for decades. Many a spreadsheet built for that purpose since computers were first put to business use. But, that seemed to me like a fairly big topic. I also figured that if I ever looked into a time series problem, most of that could be covered at that time.

So I have decided to take another look at data visualization. You know, one of the first things you should do when investigating a new dataset. Not to sure how to proceed, but such is life.

I expect I will mostly use Matplotlib, the granddaddy of data visualization libraries. I may also use Seaborn. Seaborn was born out some of the frustrations user’s had with the demanding requirements of Matplotlib. It provides an API on top of Matplotlib that makes life a touch easier, providing a goodly number of simple functions for common plot types — with reasonable defaults.

As to what datasets to look at? Well I am sure we will likely start with the Iris dataset. Perhaps look at some aspects of the Titanic dataset. Then…

The Basics

Importing

import matplotlib as mpl
import matplotlib.pyplot as plt

A good many times in the earlier posts, I never imported matplotlib just pyplot in my scripts.

Styles

As of Matplotlib Version 1.4 (August 2014) there are a number of stylesheets available. Try plt.style.available[:5]. There are two approaches to setting styles:

at the top of the workbook set a default style for the rest of the workbook: plt.style.use('classic')
use a context manager:
with plt.style.context('fivethirtyeight'):
make_a_plot()

I will set a default of classic, but likely use fivethirtyeight within a context manager for a nicer, bolder look when appropriate.

Displaying Plots

From a script, as we have seen repeatedly, plt.show() is the go to. The functions starts an event loop, looks for all currently active figure objects, and displays them in an interactive window or windows.

For a Jupyter notebook, add one of these two lines at the top of your notebook:

%matplotlib notebook, for interactive plots embedded within the notebook
%matplotlib inline, for a static image of the plot embedded in the notebook

I have to-date only used the latter. Not sure what an interactive plot would provide. By the way, the latter generates a PNG image.

Saving a Figure to a File

Simple enough. This should create a file with a PNG image of the plot in the current working directory. But a variety of other file formats, usually inferred from the file extension, are available.

fig = plt.figure()
fig.savefig('plot_1.png')

Or perhaps more simply:

plt.savefig('./plot_1.png')

Basic Line Plots

Let’s look at the simplest plot you are likely to see. We will plot the sine function — a number of times. And look at some of the various functions/parameters available to us.

Lot’s of stuff going on here. After the (usual) imports, we tell the notebook to include an inline static image of all plots. Then, tell Jupyter to use the ‘classic’ stylesheet as a default.

In [1]:

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
plt.style.use('classic')

Waves #1

Now we go onto plot our waves. First we add a full window axes to our current figure (in this case, created by default). Then, using the concise code approach, we:

set the x-axis and y-axis limits,
specify labes for the x-axis and y-axis, and
give the plot a title.

Next we create an ndarray with 250 equally spaced values between 0 and 10 inclusive. These are the x-values for our plot.

Finally we plot sine curves specifying different colours and linestyles for each plot. The colour specifications use most of the various value types available to us.

I.E. a Matplotlib named colour name, the shortcode for a named colour, a grayscale value (0-1, lower is darker), yea old hex rgb, an rgb tuple (values 0-1 for each element) and, finally, an HTML colour name.

In [2]:

# line plots: let's plot a bunch of sine waves
# fig = plt.figure()
ax = plt.axes()
ax.set(xlim=(0, 10), ylim=(-1.25, 1.25),
xlabel='x', ylabel='sin(x)',
title='A Simple Plot')
x = np.linspace(0, 10, 250)
ax.plot(x, np.sin(x - 0), color='blue', linestyle='-')
ax.plot(x, np.sin(x - 1), color='g', linestyle='–')
ax.plot(x, np.sin(x - 2), color='0.25', linestyle='-.')
ax.plot(x, np.sin(x - 3), color='#f86943', linestyle=':')
ax.plot(x, np.sin(x - 4), color=(0.2,1.0,0.3), linestyle='-')
ax.plot(x, np.sin(x - 5), color='tomato', linestyle='–')

Out[2]:

[<matplotlib.lines.Line2D at 0x24021937d30>]

Do note that we could also have specified axis limits, labels and a title with individual definitions.

plt.xlim(0, 10)
plt.ylim(-1.25, 1.25)
plt.title("A Sine Curve")
plt.xlabel("x")
plt.ylabel("sin(x)")

Style Context

Since we mentioned it above, let’s try using a style context for the same plot. This time we’ll go with the default colours provided by the stylesheet — fivethirtyeight. Oh, yes and let’s try saving the plot to an image file (PNG).

In [3]:

# let's try that with a style context and default colours
with plt.style.context('fivethirtyeight'):
  ax = plt.axes()
ax.set(xlim=(0, 10), ylim=(-1.25, 1.25),
xlabel='X', ylabel='sin(X)',
title='Waves')
x = np.linspace(0, 10, 250)
ax.plot(x, np.sin(x - 0), linestyle='-')
ax.plot(x, np.sin(x - 1), linestyle='–')
ax.plot(x, np.sin(x - 2), linestyle='-.')
ax.plot(x, np.sin(x - 3), linestyle=':')
ax.plot(x, np.sin(x - 4), linestyle='-')
ax.plot(x, np.sin(x - 5), linestyle='–')
plt.savefig('mpl_1_waves.png')

Out [3]:

Saved Image

Let’s confirm the image was saved by displaying it below. I had to cheat a bit as I moved the image into a subdirectory in the page bundle for this post. So, the markdown below doesn’t match that in the post’s related notebook. Sine Waves

Scatter Plots

Okay onto another commonly used plot type: the scatter plot. Rather than joining each point with a line segment, each point is displayed individually. With a variety of options for what is used to display each point. We will use the Iris dataset here.

In [4]:

# scatter plots using Iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
# print(iris.data.head())
# specify feature index in data set
x_ndx = 0
y_ndx = 1
# didn't like the look of classic stylesheet
with plt.style.context('default'):
scatter = plt.scatter(iris.data[:, x_ndx], iris.data[:, y_ndx], c=iris.target)
plt.xlabel(iris.feature_names[x_ndx])
plt.ylabel(iris.feature_names[y_ndx])
plt.tight_layout()
labels = np.unique(iris.target)
handles = [plt.Line2D([],[],marker="o", ls="",
color=scatter.cmap(scatter.norm(yi))) for yi in labels]
plt.legend(handles, iris.target_names)

There is of course always more than one way to look at pretty much anything.

In [5]:

# Now let's get more creative
# we set size of each point based on petal width
sz_ndx = 3
with plt.style.context('default'):
plt.scatter(iris.data[:, x_ndx], iris.data[:, y_ndx], alpha=0.2,
s=100*iris.data[:, sz_ndx], c=iris.target)
plt.xlabel(iris.feature_names[x_ndx])
plt.ylabel(iris.feature_names[y_ndx])
plt.tight_layout()
labels = np.unique(iris.target)
handles = [plt.Line2D([],[],marker="o", ls="",
color=scatter.cmap(scatter.norm(yi))) for yi in labels]
plt.legend(handles, iris.target_names)

Boxplots

Sometimes it is helpful to understand something about the variability or dispersion of your data. Boxplots are a handy way to visualize the distribution of data based on the five number summary.

One of their primary uses is to identify outliers in the data. You may wish to exclude those from the dataset.

We’ll start with a look boxplots for the four features over all the observations. Ignoring the species values.

We will also add a “notch” for the median and include the mean (red dot in this case). We’ll also add some properties for the outlier points (e.g. green colour).

In [6]:

# print(type(iris))
# print(iris.keys())
# Now a quick look at boxplots
data = iris.data[:,0:4] # read the values of the first 4 columns
# show plot
fig = plt.figure(figsize =(20, 14))
labels = iris.feature_names
# ticks = range(1, len(iris.feature_names)+1)
flierprops = dict(marker='o', markerfacecolor='green', markersize=12, linestyle='none')
plt.boxplot(data, showmeans=True, flierprops=flierprops, notch=True, labels=labels);
# plt.boxplot(data, showmeans=True, flierprops=flierprops)
# plt.xticks(ticks, labels);

Now’s let’s try to do the same kind of thing for each feature by species. We will put our 4 plots, 1 per feature, into a 2x2 grid. And, we will share the axis values across the grid.

In [7]:

# let's get a different look using Seaborn
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target']=iris.target
# print(iris.target_names)
df['species'] = df['target'].map({0:iris.target_names[0],1:iris.target_names[1],2:iris.target_names[2]})
# df.head()
fig, axes = plt.subplots(2, 2, figsize=(16,9), sharey=True, sharex=True)
sns.boxplot(x="species", y="sepal length (cm)", data=df, ax=axes[0,0])
sns.boxplot(x="species", y="sepal width (cm)", data=df, ax=axes[0,1])
sns.boxplot(x="species", y="petal length (cm)", data=df, ax=axes[1,0])
sns.boxplot(x="species", y="petal width (cm)", data=df, ax=axes[1,1]);

I wanted to see if I could recreate the above using a Seaborn FacetGrid. I eventually decided I needed to use a catplot.

Figure-level interface for drawing categorical plots onto a FacetGrid.

This function provides access to several axes-level functions that show the relationship between a numerical and one or more categorical variables using one of several visual representations. The kind parameter selects the underlying axes-level function to use:

seaborn.catplot

In [8]:

# print(df.head())
df_1 = df.drop('target', axis=1)
flat_iris = df_1.melt(id_vars='species')
# print(flat_iris.head())
fig = plt.figure(figsize=(16,9))
sns.catplot(
data=flat_iris, x='species', y='value',
col='variable', kind='box', col_wrap=2,
height=4, aspect=1.5
);

Break Time

There is so much more to look at that I think I will call this post done — for better or worse.

Next time I think I will start with a look at a Seaborn pairplot. And, maybe try to duplicate the same thing using Matplotlib directly. Then likely look at histograms and such. Maybe even check out Basemap, though I currently think that is unlikely. And, if there is time/space, maybe a look into ways to customize/add various items in Matplotlib (ticks, legends, text, etc). Though some of the latter have appeared or been used in earlier posts.

If you wish to play with the above, feel free to download my notebook covering the contents of this post.

Resources

matplotlib.pyplot
Customizing Matplotlib with style sheets and rcParams
Matplotlib Style sheets reference
matplotlib.pyplot.savefig
List of named colors
numpy.linspace
HTML colours
Boxplots & the Five-Number Summary
Understanding Boxplots
How To Create Boxplots in Python Using Matplotlib
Creating a boxplot FacetGrid in Seaborn for python
Building structured multi-plot grids
seaborn.FacetGrid
seaborn.catplot
How to Adjust the Figure Size of a Seaborn Plot
Iris Data Visualization
Seaborn Matplotlib Iris Data Visualization Code_1
famous iris dataset visualization
Exploratory Data Analysis : Iris Dataset

Too Old To Code

Data Science Basics: Data Visualization — Intro