Ok, time to have a look at pivot tables in pandas.

Pivot table? A pivot table is a derived table of statistics that helps summarize the data of a larger table. This pivoting is done in an attempt to make the data suitable for further analysis, i.e. insight. “Often you will use a pivot to demonstrate the relationship between two columns that can be difficult to reason about before the pivot.” (How to Use Pandas Pivot Table Example)

We’ve already seen that done with pandas’ group by functionality. But the pivot table functionality offers more customization.

We will once again play around with the Titanic dataset. And, I am not sure it will in fact provide anything really new. But we shall see.

As usual for each post in this particular series, Data Science Basics, a new Jupyter notebook will be started.

Some Display Options

I have decided to try a couple of setup options for this notebook. One sets the float output, the other allows for multiple variables in a cell to be displayed without using a print()/display() statement. I have put these in my first code cell along with my current imports.

In [1]:
# from IPython.display import display
from IPython.core.interactiveshell import InteractiveShell
import numpy as np
import pandas as pd
import seaborn as sns

# general options pd.options.display.float_format = '{:,.2f}'.format InteractiveShell.ast_node_interactivity = "all"

Titanic (Kaggle) Training Dataset

Let’s start by loading and reviewing the dataset.

In [2]:
# let's load our dataset, the kaggle Titanic training set
titan = pd.read_csv('./data/titanic/train.csv')
titan.head(3)
titan.tail(4)
Out[2]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.0010A/5 211717.25NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.0010PC 1759971.28C85C
2313Heikkinen, Miss. Lainafemale26.0000STON/O2. 31012827.92NaNS
Out[2]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88788811Graham, Miss. Margaret Edithfemale19.000011205330.00B42S
88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.45NaNS
88989011Behr, Mr. Karl Howellmale26.000011136930.00C148C
89089103Dooley, Mr. Patrickmale32.00003703767.75NaNQ
In [3]:
# let's review the data
titan.describe()
Out[3]:
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.00891.00891.00714.00891.00891.00891.00
mean446.000.382.3129.700.520.3832.20
std257.350.490.8414.531.100.8149.69
min1.000.001.000.420.000.000.00
25%223.500.002.0020.120.000.007.91
50%446.000.003.0028.000.000.0014.45
75%668.501.003.0038.001.000.0031.00
max891.001.003.0080.008.006.00512.33

Looking at the above you we know that in the training set:

  • there are 891 passengers
  • 38% of those survived the sinking of the Titanic
  • their ages ranged from 0.4 to 80
  • we are missing data in at least the 'Age' column

pd.pivot_table()

Let’s look at a pivot table or two. The first with a single feature index. The second with two features in the index.

Single Feature Index

We will build this pivot table based on the passenger’s sex. By default pivot_table() calculates the means for the numeric columns, ignoring categorical/non-numeric columns.

In [4]:
# let's get the pivot to work for the 'Sex' column
by_sex = pd.pivot_table(data=titan, index=['Sex'])
by_sex
Out[4]:
AgeFareParchPassengerIdPclassSibSpSurvived
Sex
female27.9244.480.65431.032.160.690.74
male30.7325.520.24454.152.390.430.19

A good many pandas objects have plotting methods built in.

In [5]:
# pivot_table by default calculates the mean for each column, ignoring non-numeric columns
# let's plot a few of those means
by_sex[['Age', 'Fare', 'Survived']].plot.bar(figsize=(10,6))
Out[5]:
<AxesSubplot:xlabel='Sex'>

Pivot Table with a Two Feature Index

Now let’s look at combining the passenger’s sex with their travel class.

In [6]:
# let's add another feature to the index
by_s_pcl = pd.pivot_table(data=titan, index=['Sex', 'Pclass'])
by_s_pcl
by_s_pcl_2 = titan.pivot_table(index=['Sex', 'Pclass'])
by_s_pcl_2
Out[6]:
AgeFareParchPassengerIdSibSpSurvived
SexPclass
female134.61106.130.46469.210.550.97
228.7221.970.61443.110.490.92
321.7516.120.80399.730.900.50
male141.2867.230.28455.730.310.37
230.7419.740.22447.960.340.16
326.5112.660.22455.520.500.14
Out[6]:
AgeFareParchPassengerIdSibSpSurvived
SexPclass
female134.61106.130.46469.210.550.97
228.7221.970.61443.110.490.92
321.7516.120.80399.730.900.50
male141.2867.230.28455.730.310.37
230.7419.740.22447.960.340.16
326.5112.660.22455.520.500.14

Well there you go, nothing remotely stunningly new!

Let’s look at another way to shape our two feature pivot table. We will ignore a few of the features where taking the mean is possibly not particularly meaningful.

In [7]:
# let's organize that table a little differently
s_pcl_3 = pd.pivot_table(titan, index = 'Sex', columns = 'Pclass', values = ['Age', 'Fare', 'Survived'], aggfunc = 'mean')
s_pcl_3
Out[7]:
AgeFareSurvived
Pclass123123123
Sex
female34.6128.7221.75106.1321.9716.120.970.920.50
male41.2830.7426.5167.2319.7412.660.370.160.14

Features to Include in Model

Let’s start with a look at our dataset.

In [9]:
titan.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Ok, a fair number of “features”, with “Age”, “Cabin” and “Embarked” missing data. Much as we discovered in the previous post looking at groupby(). Now since we will eventually be interested in predicting the probability of a passenger’s survival, let’s perhaps look at which of the above features might contribute to a higher (or lower) rate of survival.

Seems unlikely that “PassengerId”, “Name”, or “Ticket” will correlate to survival in any way. I do expect that “Sex”, “Age”, “Pclass” likely have some correlation with survival. And, if travel class does, perhaps “Fare” will also have some value as it would reflect the travel class as well as the passenger’s status in general. The latter likely having some impact on accessability to the upper decks.

But what about “Cabin” and “Embarked”. Since they are missing data do we really need to deal with that issue.

Then there is the “SibSp” and “Parch” features.

Let’s have a look. Starting with Embarked first.

The “Embarked” Feature and A Problem to Resolve

A point plot represents an estimate of central tendency for a numeric variable by the position of scatter plot points and provides some indication of the uncertainty around that estimate using error bars.

Point plots can be more useful than bar plots for focusing comparisons between different levels of one or more categorical variables. They are particularly adept at showing interactions: how the relationship between levels of one categorical variable changes across levels of a second categorical variable.

seaborn.pointplot

We are going to use a a set of Seaborn pointplots() to look at what the possible impact of the Embarked feature on survival rate.

In [10]:
fgrid = sns.FacetGrid(titan, col='Embarked', col_wrap=2, height=4.0, aspect=1.2)
fgrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette=None,  order=None, hue_order=None )
fgrid.add_legend();

Would appear that Embarked is somewhat correlated with survival, depending on the passenger’s sex. Embarking at Cherbourg, France, seems to have been not so good for the females but better for the males. The other way around for the two other ports: Queenstown, Ireland (present day Cobh, Ireland) and Southampton, U.K. So we should likely keep this feature in our eventual model.

Okay, let’s generate the related pivot table.

In [11]:
# let's looks at a similar pivot table
sp_e = pd.pivot_table(titan, index=['Pclass', 'Sex'], columns=['Embarked'], values=['Survived'], dropna=False, aggfunc='mean')
sp_e
Out[11]:
Survived
EmbarkedCQS
PclassSex
1female0.981.000.96
male0.400.000.35
2female1.001.000.91
male0.200.000.15
3female0.650.730.38
male0.230.080.13

Trouble! The data above does not match the pointplot above for embarcation at Cherbourg. What’s happening?

In [12]:
# Let's try groupby()
sp_e_gb = titan.groupby(["Sex", "Pclass", "Embarked"])["Survived"].mean().unstack()
sp_e_gb
Out[12]:
EmbarkedCQS
SexPclass
female10.981.000.96
21.001.000.91
30.650.730.38
male10.400.000.35
20.200.000.15
30.230.080.13

The result, though displayed differently, matches that of the pivot table above.

Let’s try running pivot_table as a method of the dataframe.

In [13]:
titan.pivot_table('Survived', index=['Sex', 'Pclass'], columns=['Embarked'])
Out[13]:
EmbarkedCQS
SexPclass
female10.981.000.96
21.001.000.91
30.650.730.38
male10.400.000.35
20.200.000.15
30.230.080.13

Well, still no change.

Not a clue what is going on. Will have to do some thinking and maybe some arithmetic.

Let’s start by looking at some of the numbers from the dataset.

In [14]:
#titan.pivot_table("Age", index=["Sex", "Survived", "Embarked"], columns = ["Pclass"], aggfunc='count')
sp_e_gb_2 = titan.groupby(["Sex", "Pclass", "Embarked", "Survived"])["Ticket"].count()
sp_e_gb_2
Out[14]:
Sex     Pclass  Embarked  Survived
female  1       C         0             1
                          1            42
                Q         1             1
                S         0             2
                          1            46
        2       C         1             7
                Q         1             2
                S         0             6
                          1            61
        3       C         0             8
                          1            15
                Q         0             9
                          1            24
                S         0            55
                          1            33
male    1       C         0            25
                          1            17
                Q         0             1
                S         0            51
                          1            28
        2       C         0             8
                          1             2
                Q         0             1
                S         0            82
                          1            15
        3       C         0            33
                          1            10
                Q         0            36
                          1             3
                S         0           231
                          1            34
Name: Ticket, dtype: int64

Looks to me like the data is consistent and the pointplot is out of sync.

But let’s try limiting it to data for those who embarked at Cherbourg. And maybe a little prettier.

In [15]:
sp_e_gb_2 = titan.loc[titan["Embarked"]=="C"].groupby(["Sex", "Pclass", "Survived"])["Ticket"].count()
sp_e_gb_3 = pd.pivot_table(titan.loc[titan["Embarked"]=="C"], index=['Pclass', 'Sex', 'Survived'], values=['Ticket'], aggfunc='count')
sp_e_gb_3
Out[15]:
Ticket
PclassSexSurvived
1female01
142
male025
117
2female17
male08
12
3female08
115
male033
110

I know. Not really getting anywhere. Let’s try a simple cross-tabulation or two.

In [16]:
pd.crosstab(titan.loc[titan["Sex"]=="female"]['Survived'], titan['Embarked'], margins=True)
pd.crosstab(titan.loc[titan["Sex"]=="male"]['Survived'], titan['Embarked'], margins=True)
Out[16]:
EmbarkedCQSAll
Survived
0996381
16427140231
All7336203312
Out[16]:
EmbarkedCQSAll
Survived
06638364468
129377109
All9541441577
In [17]:
e_tst = titan.loc[(titan["Sex"]=="female") & (titan['Embarked']=='C')].groupby(["Pclass"])
print(e_tst["Survived"].mean())
Pclass
1   0.98
2   1.00
3   0.65
Name: Survived, dtype: float64

Ok. Must be the plot. Let’s try generating a single pointplot just for the Cherbourg data.

In [18]:
sns.pointplot(x="Pclass", y="Survived", hue="Sex", data=titan.loc[titan["Embarked"]=="C"]);

And, look at that! The data for females and males has switched from the plot in the FacetGrid. Time to do a little research.

Took me awhile to sort the possible cause. No easy search results. Had to do plenty of reading and some guessing. I decided not to go with the default for hue_order. When I left that parameter out of the statement, a warning was displayed.

…\envs\ds-3.9\lib\site-packages\seaborn\axisgrid.py:648: UserWarning: Using the pointplot function without specifying hue_order is likely to produce an incorrect plot.

So, let’s specify a hue_order.

In [19]:
fgrid_2 = sns.FacetGrid(titan, col='Embarked', col_wrap=2, height=4.0, aspect=1.2, sharex=False)
# from the single chart above, clearly something not working correctly in the first set of plots
# let's specifically state the order for the sexes
fgrid_2.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette=None, order=None, hue="Sex", hue_order=["female", "male"] )
fgrid_2.add_legend();

Bingo. The chart above matches the pivot table data. Finally!

Okay, now it looks like the port of embarcation is of less value in determining probability of survival than seemed to be the case after our initial attempt to determine the presence of any correlation. Likely can be dropped from the model.

But, looking at these charts it seems pretty clear that “Sex” and “PClass” are in some way correlated with survival.

Glad I decided to generate a pivot table to match the first set of pointplots.

This Post Done

Sorry, but it has taken me quite some time to sort out why the first pointplots and the dataset values, via pivot tables, etc., did not match. Though I likely barely covered pivot tables, I think I will take a break and consider this discussion on pivot tables done.

Not sure where I will go next. May quick look at time series. Maybe move onto visualization, and look at what we can do with Matplotlib and/or Seaborn.

If you wish to play with the above, feel free to download my notebook covering the contents of this post.

Resources