Ok, time to have a look at pivot tables in pandas.
Pivot table? A pivot table is a derived table of statistics that helps summarize the data of a larger table. This pivoting is done in an attempt to make the data suitable for further analysis, i.e. insight. “Often you will use a pivot to demonstrate the relationship between two columns that can be difficult to reason about before the pivot.” (How to Use Pandas Pivot Table Example)
We’ve already seen that done with pandas’ group by functionality. But the pivot table functionality offers more customization.
We will once again play around with the Titanic dataset. And, I am not sure it will in fact provide anything really new. But we shall see.
As usual for each post in this particular series, Data Science Basics, a new Jupyter notebook will be started.
Some Display Options
I have decided to try a couple of setup options for this notebook. One sets the float output, the other allows for multiple variables in a cell to be displayed without using a print()/display() statement. I have put these in my first code cell along with my current imports.
# from IPython.display import display
from IPython.core.interactiveshell import InteractiveShell
import numpy as np
import pandas as pd
import seaborn as sns
# general options
pd.options.display.float_format = '{:,.2f}'.format
InteractiveShell.ast_node_interactivity = "all"
Titanic (Kaggle) Training Dataset
Let’s start by loading and reviewing the dataset.
# let's load our dataset, the kaggle Titanic training set
titan = pd.read_csv('./data/titanic/train.csv')
titan.head(3)
titan.tail(4)
# let's review the data
titan.describe()
Looking at the above you we know that in the training set:
- there are 891 passengers
- 38% of those survived the sinking of the Titanic
- their ages ranged from 0.4 to 80
- we are missing data in at least the 'Age' column
pd.pivot_table()
Let’s look at a pivot table or two. The first with a single feature index. The second with two features in the index.
Single Feature Index
We will build this pivot table based on the passenger’s sex. By default pivot_table()
calculates the means for the numeric columns, ignoring categorical/non-numeric columns.
# let's get the pivot to work for the 'Sex' column
by_sex = pd.pivot_table(data=titan, index=['Sex'])
by_sex
A good many pandas objects have plotting methods built in.
# pivot_table by default calculates the mean for each column, ignoring non-numeric columns
# let's plot a few of those means
by_sex[['Age', 'Fare', 'Survived']].plot.bar(figsize=(10,6))
Pivot Table with a Two Feature Index
Now let’s look at combining the passenger’s sex with their travel class.
# let's add another feature to the index
by_s_pcl = pd.pivot_table(data=titan, index=['Sex', 'Pclass'])
by_s_pcl
by_s_pcl_2 = titan.pivot_table(index=['Sex', 'Pclass'])
by_s_pcl_2
Well there you go, nothing remotely stunningly new!
Let’s look at another way to shape our two feature pivot table. We will ignore a few of the features where taking the mean is possibly not particularly meaningful.
# let's organize that table a little differently
s_pcl_3 = pd.pivot_table(titan, index = 'Sex', columns = 'Pclass', values = ['Age', 'Fare', 'Survived'], aggfunc = 'mean')
s_pcl_3
Features to Include in Model
Let’s start with a look at our dataset.
titan.info()
Ok, a fair number of “features”, with “Age”, “Cabin” and “Embarked” missing data. Much as we discovered in the previous post looking at groupby()
. Now since we will eventually be interested in predicting the probability of a passenger’s survival, let’s perhaps look at which of the above features might contribute to a higher (or lower) rate of survival.
Seems unlikely that “PassengerId”, “Name”, or “Ticket” will correlate to survival in any way. I do expect that “Sex”, “Age”, “Pclass” likely have some correlation with survival. And, if travel class does, perhaps “Fare” will also have some value as it would reflect the travel class as well as the passenger’s status in general. The latter likely having some impact on accessability to the upper decks.
But what about “Cabin” and “Embarked”. Since they are missing data do we really need to deal with that issue.
Then there is the “SibSp” and “Parch” features.
Let’s have a look. Starting with Embarked first.
The “Embarked” Feature and A Problem to Resolve
A point plot represents an estimate of central tendency for a numeric variable by the position of scatter plot points and provides some indication of the uncertainty around that estimate using error bars.
Point plots can be more useful than bar plots for focusing comparisons between different levels of one or more categorical variables. They are particularly adept at showing interactions: how the relationship between levels of one categorical variable changes across levels of a second categorical variable.
seaborn.pointplot
We are going to use a a set of Seaborn pointplots()
to look at what the possible impact of the Embarked feature on survival rate.
fgrid = sns.FacetGrid(titan, col='Embarked', col_wrap=2, height=4.0, aspect=1.2)
fgrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette=None, order=None, hue_order=None )
fgrid.add_legend();
Would appear that Embarked
is somewhat correlated with survival, depending on the passenger’s sex. Embarking at Cherbourg, France, seems to have been not so good for the females but better for the males. The other way around for the two other ports: Queenstown, Ireland (present day Cobh, Ireland) and Southampton, U.K. So we should likely keep this feature in our eventual model.
Okay, let’s generate the related pivot table.
# let's looks at a similar pivot table
sp_e = pd.pivot_table(titan, index=['Pclass', 'Sex'], columns=['Embarked'], values=['Survived'], dropna=False, aggfunc='mean')
sp_e
Trouble! The data above does not match the pointplot above for embarcation at Cherbourg. What’s happening?
# Let's try groupby()
sp_e_gb = titan.groupby(["Sex", "Pclass", "Embarked"])["Survived"].mean().unstack()
sp_e_gb
The result, though displayed differently, matches that of the pivot table above.
Let’s try running pivot_table
as a method of the dataframe.
titan.pivot_table('Survived', index=['Sex', 'Pclass'], columns=['Embarked'])
Well, still no change.
Not a clue what is going on. Will have to do some thinking and maybe some arithmetic.
Let’s start by looking at some of the numbers from the dataset.
#titan.pivot_table("Age", index=["Sex", "Survived", "Embarked"], columns = ["Pclass"], aggfunc='count')
sp_e_gb_2 = titan.groupby(["Sex", "Pclass", "Embarked", "Survived"])["Ticket"].count()
sp_e_gb_2
Looks to me like the data is consistent and the pointplot is out of sync.
But let’s try limiting it to data for those who embarked at Cherbourg. And maybe a little prettier.
sp_e_gb_2 = titan.loc[titan["Embarked"]=="C"].groupby(["Sex", "Pclass", "Survived"])["Ticket"].count()
sp_e_gb_3 = pd.pivot_table(titan.loc[titan["Embarked"]=="C"], index=['Pclass', 'Sex', 'Survived'], values=['Ticket'], aggfunc='count')
sp_e_gb_3
I know. Not really getting anywhere. Let’s try a simple cross-tabulation or two.
pd.crosstab(titan.loc[titan["Sex"]=="female"]['Survived'], titan['Embarked'], margins=True)
pd.crosstab(titan.loc[titan["Sex"]=="male"]['Survived'], titan['Embarked'], margins=True)
e_tst = titan.loc[(titan["Sex"]=="female") & (titan['Embarked']=='C')].groupby(["Pclass"])
print(e_tst["Survived"].mean())
Ok. Must be the plot. Let’s try generating a single pointplot just for the Cherbourg data.
sns.pointplot(x="Pclass", y="Survived", hue="Sex", data=titan.loc[titan["Embarked"]=="C"]);
And, look at that! The data for females and males has switched from the plot in the FacetGrid. Time to do a little research.
Took me awhile to sort the possible cause. No easy search results. Had to do plenty of reading and some guessing. I decided not to go with the default for hue_order
. When I left that parameter out of the statement, a warning was displayed.
…\envs\ds-3.9\lib\site-packages\seaborn\axisgrid.py:648: UserWarning: Using the pointplot function without specifying
hue_order
is likely to produce an incorrect plot.
So, let’s specify a hue_order
.
fgrid_2 = sns.FacetGrid(titan, col='Embarked', col_wrap=2, height=4.0, aspect=1.2, sharex=False)
# from the single chart above, clearly something not working correctly in the first set of plots
# let's specifically state the order for the sexes
fgrid_2.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette=None, order=None, hue="Sex", hue_order=["female", "male"] )
fgrid_2.add_legend();
Bingo. The chart above matches the pivot table data. Finally!
Okay, now it looks like the port of embarcation is of less value in determining probability of survival than seemed to be the case after our initial attempt to determine the presence of any correlation. Likely can be dropped from the model.
But, looking at these charts it seems pretty clear that “Sex” and “PClass” are in some way correlated with survival.
Glad I decided to generate a pivot table to match the first set of pointplots.
This Post Done
Sorry, but it has taken me quite some time to sort out why the first pointplots and the dataset values, via pivot tables, etc., did not match. Though I likely barely covered pivot tables, I think I will take a break and consider this discussion on pivot tables done.
Not sure where I will go next. May quick look at time series. Maybe move onto visualization, and look at what we can do with Matplotlib and/or Seaborn.
If you wish to play with the above, feel free to download my notebook covering the contents of this post.
Resources
- pandas.pivot_table
- seaborn.FacetGrid
- seaborn.pointplot
- Reshaping and pivot tables
- Reshaping in Pandas
- Pivot, Pivot-Table, Stack and Unstack - Explore Happiness Data Using Python Pivot Tables
- Python Pivot Tables – The Ultimate Guide = How to Use Pandas Pivot Table
- Pandas Pivot Table in Python
- A Step-by-Step Guide to Pandas Pivot Tables
- Pandas Pivot: A Guide with Examples
- Pivot tables – the Swiss Army Knife of data analysis
- Pandas Pivot Table Explained
- Building structured multi-plot grids