Last time we took a quick look at using line plots, scatter plots and boxplots to help visualize a dataset. We also had a quick review of using Matplotlib and some styling options.

I am going to start this post with a look at the Seaborn pairplot functionality. Hopefully you’ll see why by the time I am done. I may also look at how to code something similar using bare Matplotlib functionality. Though not sure about the latter just yet.

Then I will take a quick look at histograms, maybe plain bar plots. And, that may end up being it for this post. Though some optional content is under consideration.

`pairplot`

**class seaborn.PairGrid(kwargs)

Subplot grid for plotting pairwise relationships in a dataset.

This object maps each variable in a dataset onto a column and row in a grid of multiple axes. Different axes-level plotting functions can be used to draw bivariate plots in the upper and lower triangles, and the the marginal distribution of each variable can be shown on the diagonal.

Several different common plots can be generated in a single line using pairplot(). Use PairGrid when you need more flexibility.

seaborn.pairplot(data, *, …)

Plot pairwise relationships in a dataset.

By default, this function will create a grid of Axes such that each numeric variable in data will by shared across the y-axes across a single row and the x-axes across a single column. The diagonal plots are treated differently: a univariate distribution plot is drawn to show the marginal distribution of the data in each column.

It is also possible to show a subset of variables or plot different variables on the rows and columns.

This is a high-level interface for PairGrid that is intended to make it easy to draw a few common styles. You should use PairGrid directly if you need more flexibility.

Personally, I believe this may be a case of seeing is understanding, or some such.

Iris Dataset (sklearn)

In [1]:

# let's set things up
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_iris
%matplotlib inline
plt.style.use('default')

In [2]:

# get the iris dataset
iris = load_iris()
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['target'] = iris.target
df_iris['species'] = df_iris['target'].map({0:iris.target_names[0],1:iris.target_names[1],2:iris.target_names[2]})
df_iris.drop('target', axis=1, inplace=True)
# print(type(iris))
df_iris.head()

Out[2]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

In [3]:

sns.pairplot(df_iris, hue="species", height=3, kind='scatter');
plt.savefig('iris_pairplot.png')

Note: I have not included the HTML generated from the Jupyter notebook as that produced some 27,000 lines of SVG to display the pairplot. So I saved it and am displaying the saved PNG image. And it did take some time to generate the plot.

pairplot of iris dataset

Now what’s in that pairplot. There are a number of scatter plots (upper and lower triangular areas), and a set of layered kernel density estimate (KDE) plots along the diagonal. You may have noticed that the scatter plots are rotations (mirrored images?) of the respective plot in each triangular area.

The KDE plots show the data distribution of a single variable (feature) for each species. The scatter plots show the relationship between two features, with a complete set of feature pairings being provided across the complete pairplot.

Looking at the data distributions, it would seem that petal length and petal width are the most useful features for identifying each species. And looks like Setosa can be easily identified (linearly separable) by those two features. Whereas, Virginica and Versicolor have some overlap (almost linearly separable) in their values. There is considerably more overlap between the species when one looks at the sepal length and sepal width.

There are issues with the pairplot. It only plots numerical features. Though you could look at encoding any such features you are interested in. A large number of features is going to generate a lot of individual plots. May prove most difficult to review them all. That said, in the right situations, it does have value.

Tips Dataset (Seaborn)

Let’s try that again but with a different dataset: the Seaborn tips dataset.

In [4]:

sns.set_style('whitegrid')
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
# Let's have a look at a different dataset
tips = sns.load_dataset('tips')
print(type(tips))
print(tips.head())
# Let's give time a numeric equivalent

<class 'pandas.core.frame.DataFrame'>
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

There are some categorical features that I’d like to include in the pairplot. But pairplot() only works with numeric variables/features. So, let’s try adding numerical columns/features for of the two categorical features.

Categorical worked well enough for the two category time feature. But, I wanted to control the number order for the day feature. Turns out that mapping a categorical column to a new column generates a categorical column/series. And, pairplot() doesn’t like categorical columns. So, given the values were small integers, I changed the type to int8.

In [5]:

# let's add numerical equivalent for time
print(tips.time.unique())
# tips['meal'] = tips['time'].map({'Lunch': 1.0, 'Dinner': 2.0})
tips['meal'] = pd.Categorical(tips.time).codes
print(tips.day.unique())
tips['dow'] = pd.Categorical(tips.day).codes
dsow = ['Thur', 'Fri', 'Sat', 'Sun']
tips['dow'] = tips['day'].apply(lambda x: dsow.index(x))
tips.dow = tips.dow.astype('int8')
print(tips.head())
print(tips.dtypes)

['Dinner', 'Lunch'] Categories (2, object): ['Dinner', 'Lunch'] ['Sun', 'Sat', 'Thur', 'Fri'] Categories (4, object): ['Sun', 'Sat', 'Thur', 'Fri']

total_bill tip sex smoker day time size meal dow 0 16.99 1.01 Female No Sun Dinner 2 1 3 1 10.34 1.66 Male No Sun Dinner 3 1 3 2 21.01 3.50 Male No Sun Dinner 3 1 3 3 23.68 3.31 Male No Sun Dinner 2 1 3 4 24.59 3.61 Female No Sun Dinner 4 1 3

total_bill float64 tip float64 sex category smoker category day category time category size int64 meal int8 dow int8 dtype: object

In this case, I am specifying the columns I want included in the pairplot. I am also specifying a size for the individual plots.

In [6]:

g = sns.pairplot(tips, hue="sex", vars=['total_bill', 'tip', 'meal', 'dow'], height=2.5, aspect=1.4, kind='scatter');
g.savefig('tips_pairplot.png')

pairplot of tips dataset

Not too sure the pairplot provided us with much insight. But, perhaps it helps us figure out where to look next.

`relplot` & `jointplot`

During my reading/research I came across numerous other plot types provided by Seaborn and/or Matplotlib. For now I am sticking with Seaborn. According to the Seaborn documentation, `relplot` provides:

Figure-level interface for drawing relational plots onto a FacetGrid.

And, `jointplot`:

Draw a plot of two variables with bivariate and univariate graphs.

I decided to do something a little more involved than the basic relplot functionality. Hope you like it. Note, for the following plot images, I used the save link Seaborn provided on each of the plots in the notebook.

In [7]:

# this isn't the simplest form, but...
with sns.axes_style("darkgrid", {'grid.color': '.5'}):
    sns.relplot(data=tips, x="total_bill", y="tip", hue="sex", col="day", col_wrap=2, height=3.5, aspect=1.25)

seaborn relplot based on tips dataset

Some big tippers amongst the males on Saturday! Alcohol?

This `jointplot` is pretty much in its most basic form. I don't really understand the statistics involved well enough to try anything else. But, if you know it exists, you may be able to use it and gain an intuitive understanding of the data.

Kernel density estimation is a really useful statistical tool with an intimidating name. Often shortened to KDE, it’s a technique that let’s you create a smooth curve given a set of data.

This can be useful if you want to visualize just the “shape” of some data, as a kind of continuous replacement for the discrete histogram. It can also be used to generate points that look like they came from a certain dataset - this behavior can power simple simulations, where simulated objects are modeled off of real data.

Kernel Density Estimation, By: Matthew Conlen

There a number of values for kind. But reg (regression?) seemed most meaningful to me (due, of course, to my lack of knowledge). Whatever the case, looks pretty cool.

In [8]:

# kernel density estimation and regression
sns.jointplot(x="total_bill", y="tip", data=tips, kind='reg');

seaborn regression joinplot based on tips dataset

Histogram

Ok, we’ve seen histograms before. But, perhaps not used quite as follows. This section will be short but sweet — I hope.

We will add a column for tips as percentage of the total bill. We will then create a FacetGrid showing the distribution of tip percentage for the time of the meal and the sex of the person paying for it.

In [9]:

# okay let's have a quick look at using histograms to evaluate the data
# will use facetgrid, this will be somewhat similar to the relplot above
tips['pcnt'] = 100 * tips['tip'] / tips['total_bill']
pcnts = tips['pcnt'].unique()
gt30 = pcnts > 30
print(pcnts[gt30])
# couple of outliers, so let's just plot values from 0 - 40, with 15 bins
g = sns.FacetGrid(tips, row="sex", col="time", margin_titles=True, height=3, aspect=1.25)
g.map(plt.hist, "pcnt", bins=np.linspace(0, 40, 15));

FacetGrid of histograms based on tips dataset

So, other than total numbers, seems to be a fair bit of similarity in the tip distribution regardless of meal or sex. Though looks like average percent tip is lower for lunchs.

Now, let’s do that using the day of the week instead of the meal time.

In [10]:

# let's try that with day of week, and some other histogram options just cuz
g = sns.FacetGrid(tips, row="sex", col="day", margin_titles=True, height=3)
g.map(plt.hist, "pcnt", bins=np.linspace(0, 40, 15), histtype='stepfilled', color='powderblue', edgecolor='none');

2nd FacetGrid of histograms based on tips dataset

Done?

Didn’t cover many more possible plot types and/or visualization possibilities. Too many. All I wanted was a decent start to knowing what was out there. So, that is likely it for this one.

Not really sure what I’ve learned from the above. Certainly still don’t know how to interpret the plots in any meaningful way. But, that will hopefully come with more reading and practice.

But in some of the cases I had to manipulate the data to get what I wanted. That was definitely a good learning experience for me.

Not sure where I am going to go next. This is pretty much it for the basic tools/skills. Perhaps I should look at tackling a specific dataset or two, going through the whole process top to bottom (well not sure about the direction). Building and testing a model.

If you wish to play with the above, feel free to download my notebook covering the contents of this post.

Resources

seaborn.PairGrid
seaborn.pairplot
seaborn.relplot
seaborn.jointplot
pandas.DataFrame.dtypes
pandas.DataFrame.astype
pandas.Categorical
pandas.Series.unique
Kernel Density Estimation, By: Matthew Conlen

Too Old To Code

Data Science Basics: Data Visualization, Part II

`pairplot`

Iris Dataset (sklearn)

Tips Dataset (Seaborn)

`relplot` & `jointplot`

Histogram

Done?

Resources

pairplot

Iris Dataset (sklearn)

Tips Dataset (Seaborn)

relplot & jointplot

Histogram

Done?

Resources

`pairplot`

`relplot` & `jointplot`