This is likely going to be an inadequate post. I am going to try to cover groupby and pivot in a single post. Something that should likely take 3 or 4 posts to really cover adequately. But, there are tons of articles/posts out there covering these topics. And, I really just want to get an idea of the basics, not the intracacies. Always time for the latter when I get working on real life projects.

That said, as near as I can tell, pivot tables are groupby on steroids. Let’s find out. Though I used Excel in a variety of ways personally and at work, I somehow never got into using pivot tables. Don’t need, don’t spend/waste time?

To be clear we are talking about re-shaping our data and about efficient summarization (aggregation) of that data whatever shape it is in currently.

Aggregation? You know: sum(), mean(), median(), min(), max(), std() and the like. And perhaps, one well worth knowing about describe()

Simple Aggregation

Let’s have a look at some simple examples. Very similar to what we previously covered (NumPy and pandas series and dataframes). We’ll start with series, then look at dataframes.

In [1]:
from IPython.display import display
import numpy as np
import pandas as pd
import seaborn as sns
In [2]:
# aggregaton on series returns single value
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
display(ser)
print(f"\nseries sum: {ser.sum()}")
print(f"series mean: {ser.mean()}")
print(f"number of items in series: {ser.count()}")
0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64
series sum: 2.811925491708157
series mean: 0.5623850983416314
number of items in series: 5
In [3]:
# for a dataframe you, by default, get a value for each column, this can be altered
df = pd.DataFrame({'A': rng.rand(5),
                   'B': rng.rand(5)})
display(df)
print(f"\ndataframe sum:\n{df.sum()}")
print(f"\ndataframe mean:\n{df.mean()}")
print(f"\nnumber of items in columns:\n{df.count()}")
AB
00.1559950.020584
10.0580840.969910
20.8661760.832443
30.6011150.212339
40.7080730.181825
dataframe sum:
A    2.389442
B    2.217101
dtype: float64

dataframe mean: A 0.477888 B 0.443420 dtype: float64

number of items in columns: A 5 B 5 dtype: int64

In [4]:
# but we can get it to operate on the rows by specifying an axis
print(f"\ndataframe sum by row:\n{df.sum(axis='columns')}")
print(f"\ndataframe mean by row:\n{df.mean(axis='columns')}")
dataframe sum by row:
0    0.176579
1    1.027993
2    1.698619
3    0.813454
4    0.889898
dtype: float64

dataframe mean by row: 0 0.088290 1 0.513997 2 0.849309 3 0.406727 4 0.444949 dtype: float64

Nuff said!

A Real Life Dataset

Let’s look at something a touch more realistic. We’ll use one of the more common tutorial datasets: the Titanic dataset. I am going to use the training set from Kaggle — “the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works”. I have downloaded it to a local directory.

There are also other sources should you not wish to sign-up with Kaggle. e.g. Seaborn has a builtin version, though with a slightly different set of features.

In [5]:
# Let's try a more complicated dataset: Titanic
titan = pd.read_csv('./data/titanic/train.csv')
display(titan.head(5))
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

Mentioned the describe() method above. Let’s have a look. Do note, categorical features are not included in the result. The second view drops any rows containing ’n/a’ (null) values. Of which there are apparently quite a few.

In [6]:
# Let's try that describe() method
display(titan.describe())
print()
display(titan.dropna().describe())
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200

Looking at the above you we know that in the training set:

  • there are 891 passengers
  • 38% of those survived the sinking of the Titanic
  • their ages ranged from 0.4 to 80
  • we are missing data in at least the 'Age' column
PassengerIdSurvivedPclassAgeSibSpParchFare
count183.000000183.000000183.000000183.000000183.000000183.000000183.000000
mean455.3661200.6721311.19125735.6744260.4644810.47541078.682469
std247.0524760.4707250.51518715.6438660.6441590.75461776.347843
min2.0000000.0000001.0000000.9200000.0000000.0000000.000000
25%263.5000000.0000001.00000024.0000000.0000000.00000029.700000
50%457.0000001.0000001.00000036.0000000.0000000.00000057.000000
75%676.0000001.0000001.00000047.5000001.0000001.00000090.000000
max890.0000001.0000003.00000080.0000003.0000004.000000512.329200

Groupby()

Now let’s try our first groupby. We’ll group the dataset on the sex of the passengers (in those days they apparently only had two).

In [7]:
by_sex = titan.groupby("Sex")
display(by_sex.describe())
display(by_sex['Survived'].describe())
display(by_sex[['Survived', 'Fare', 'SibSp', 'Parch']].mean())
PassengerIdSurvived...ParchFare
countmeanstdmin25%50%75%maxcountmean...75%maxcountmeanstdmin25%50%75%max
Sex
female314.0431.028662256.8463242.0231.75414.5641.25889.0314.00.742038...1.06.0314.044.47981857.9976986.7512.07187523.055.00512.3292
male577.0454.147314257.4861391.0222.00464.0680.00891.0577.00.188908...0.05.0577.025.52389343.1382630.007.89580010.526.55512.3292

2 rows × 56 columns

countmeanstdmin25%50%75%max
Sex
female314.00.7420380.4382110.00.01.01.01.0
male577.00.1889080.3917750.00.00.00.01.0
SurvivedFareSibSpParch
Sex
female0.74203844.4798180.6942680.649682
male0.18890825.5238930.4298090.235702

Okay, how about a simple count of the number of people, by sex, who survived the sinking.

In [6]:
display(titan.groupby("Sex")["Survived"].count())
Sex
female    314
male      577
Name: Survived, dtype: int64

Now, let’s add the persons age into our grouping of survival counts.

In [7]:
display(titan.groupby(["Sex", "Age"])["Survived"].count())
Sex     Age  
female  0.75     2
        1.00     2
        2.00     6
        3.00     2
        4.00     5
                ..
male    70.00    2
        70.50    1
        71.00    2
        74.00    1
        80.00    1
Name: Survived, Length: 145, dtype: int64

Not so helpful. But we will eventually fix that. But first, let’s do something about the missing data (where appropriate). Just so you know, I am not currently going to worry about using the cabin information nor do anything about any missing cabin information. Another day perhaps.

In [8]:
# Let's look at missing data
display(titan.isnull().sum())
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Dealing with Missing Data

We’ll start by dealing with the 2 missing embarkation locations. Turns out that with a bit of hunting, we don’t have to guess.

In [9]:
# Ok let's just sort out missing embarked first
display(titan.loc[titan['Embarked'].isnull()])
# searching suitable sources, turns out both embarked at Southhampton
# https://www.encyclopedia-titanica.org/titanic-survivor/amelia-icard.html
# https://www.encyclopedia-titanica.org/titanic-survivor/martha-evelyn-stone.html
titan.loc[titan['Embarked'].isnull(), 'Embarked'] = 'S'
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
616211Icard, Miss. Ameliefemale38.00011357280.0B28NaN
82983011Stone, Mrs. George Nelson (Martha Evelyn)female62.00011357280.0B28NaN

Because of the large number of missing ages, I was not going to try and search the web for answers. Instead I am going to use some sort of average age. Looking at sex and passenger class, it looks like the median of those groups should work reasonably well.

In [10]:
# Age is a bit harder, but I am going to use the median age of the data grouped on sex and passenger class
print('Median age:')
display(titan.groupby(['Pclass', 'Sex'])['Age'].median())
print('\nCount by group:')
display(titan.groupby(['Pclass', 'Sex'])['Age'].count())
# let's fill the missing values
titan['Age'] = titan.groupby(['Pclass', 'Sex'])['Age'].apply(lambda x: x.fillna(x.median()))
print()
display(titan.isnull().sum())
Median age:
Pclass  Sex   
1       female    35.0
        male      40.0
2       female    28.0
        male      30.0
3       female    21.5
        male      25.0
Name: Age, dtype: float64
Count by group:
Pclass  Sex   
1       female     85
        male      101
2       female     74
        male       99
3       female    102
        male      253
Name: Age, dtype: int64
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

Binning Data

Now, if you remember above trying to group on age doesn’t really work as there are just too many ages. The same will be true when we try to group on fare. So, we will look at creating groups based on suitable ranges for each. And, pandas has some functions to help us out.

We’ll start with a quick look at a box-and-whisker plot for the ‘Age’ data. There are a number of outliers. Something similar looking for fare; but for fare, the range of values is much higher.

In [11]:
# we still need to do some work with Age and Fare
# I propose we create some more meaningful groups to more easily visualize possibilities
titan.boxplot(column=['Age'], figsize=(15,7))
Out[11]:
<AxesSubplot:>

For fare we will split the cases into buckets of a similar size using pd.qcut(). This may cause a touch of overlap, but should provide a reasonable binning. But, for ages it seems more reasonable to use buckets of equal age ranges. Since that would more likely better reflect the odds of survival than ranges based on the number of members as we are doing with fares. For age we will use pd.cut().

In [12]:
# fair number of outliers, similar story for Fare
# so let's cut our data into blocks so the outliers are less of an issue
print(type(titan['Age'][0]))
# ~! rerunning this cell caused errors,
# so had to put the 'cut's in a suitable conditional
if type(titan['Age'][0]) == np.float64:
    titan['Age'] = pd.cut(titan['Age'].astype(int), 5)
    titan['Fare'] = pd.qcut(titan['Fare'], 5)
print(type(titan['Age'][0]))
print("for age, each category has a different number of cases")
print("but each category is of a similar range")
display(titan['Age'].value_counts())
print("for fare, each category has almost the same number of cases")
display(titan['Fare'].value_counts())
<class 'numpy.float64'>
<class 'pandas._libs.interval.Interval'>
for age, each category has a different number of cases
but each category is of a similar range
(16.0, 32.0]     495
(32.0, 48.0]     216
(-0.08, 16.0]    100
(48.0, 64.0]      69
(64.0, 80.0]      11
Name: Age, dtype: int64
for fare, each category has almost the same number of cases
(7.854, 10.5]        184
(21.679, 39.688]     180
(-0.001, 7.854]      179
(39.688, 512.329]    176
(10.5, 21.679]       172
Name: Fare, dtype: int64

Survival Rate for Different Groupings

Now let’s look at the survival rate when grouping our data on different columns.