This is likely going to be an inadequate post. I am going to try to cover groupby and pivot in a single post. Something that should likely take 3 or 4 posts to really cover adequately. But, there are tons of articles/posts out there covering these topics. And, I really just want to get an idea of the basics, not the intracacies. Always time for the latter when I get working on real life projects.

That said, as near as I can tell, pivot tables are groupby on steroids. Let’s find out. Though I used Excel in a variety of ways personally and at work, I somehow never got into using pivot tables. Don’t need, don’t spend/waste time?

To be clear we are talking about re-shaping our data and about efficient summarization (aggregation) of that data whatever shape it is in currently.

Aggregation? You know: sum(), mean(), median(), min(), max(), std() and the like. And perhaps, one well worth knowing about describe()

Simple Aggregation

Let’s have a look at some simple examples. Very similar to what we previously covered (NumPy and pandas series and dataframes). We’ll start with series, then look at dataframes.

In [1]:

from IPython.display import display
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:

# aggregaton on series returns single value
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
display(ser)
print(f"\nseries sum: {ser.sum()}")
print(f"series mean: {ser.mean()}")
print(f"number of items in series: {ser.count()}")

0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64

series sum: 2.811925491708157
series mean: 0.5623850983416314
number of items in series: 5

In [3]:

# for a dataframe you, by default, get a value for each column, this can be altered
df = pd.DataFrame({'A': rng.rand(5),
                   'B': rng.rand(5)})
display(df)
print(f"\ndataframe sum:\n{df.sum()}")
print(f"\ndataframe mean:\n{df.mean()}")
print(f"\nnumber of items in columns:\n{df.count()}")

	A	B
0	0.155995	0.020584
1	0.058084	0.969910
2	0.866176	0.832443
3	0.601115	0.212339
4	0.708073	0.181825

dataframe sum: A 2.389442 B 2.217101 dtype: float64

dataframe mean: A 0.477888 B 0.443420 dtype: float64

number of items in columns: A 5 B 5 dtype: int64

In [4]:

# but we can get it to operate on the rows by specifying an axis
print(f"\ndataframe sum by row:\n{df.sum(axis='columns')}")
print(f"\ndataframe mean by row:\n{df.mean(axis='columns')}")

dataframe sum by row: 0 0.176579 1 1.027993 2 1.698619 3 0.813454 4 0.889898 dtype: float64

dataframe mean by row: 0 0.088290 1 0.513997 2 0.849309 3 0.406727 4 0.444949 dtype: float64

Nuff said!

A Real Life Dataset

Let’s look at something a touch more realistic. We’ll use one of the more common tutorial datasets: the Titanic dataset. I am going to use the training set from Kaggle — “the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works”. I have downloaded it to a local directory.

There are also other sources should you not wish to sign-up with Kaggle. e.g. Seaborn has a builtin version, though with a slightly different set of features.

In [5]:

# Let's try a more complicated dataset: Titanic
titan = pd.read_csv('./data/titanic/train.csv')
display(titan.head(5))

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Mentioned the describe() method above. Let’s have a look. Do note, categorical features are not included in the result. The second view drops any rows containing ’n/a’ (null) values. Of which there are apparently quite a few.

In [6]:

# Let's try that describe() method
display(titan.describe())
print()
display(titan.dropna().describe())

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

Looking at the above you we know that in the training set:

there are 891 passengers
38% of those survived the sinking of the Titanic
their ages ranged from 0.4 to 80
we are missing data in at least the 'Age' column

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	183.000000	183.000000	183.000000	183.000000	183.000000	183.000000	183.000000
mean	455.366120	0.672131	1.191257	35.674426	0.464481	0.475410	78.682469
std	247.052476	0.470725	0.515187	15.643866	0.644159	0.754617	76.347843
min	2.000000	0.000000	1.000000	0.920000	0.000000	0.000000	0.000000
25%	263.500000	0.000000	1.000000	24.000000	0.000000	0.000000	29.700000
50%	457.000000	1.000000	1.000000	36.000000	0.000000	0.000000	57.000000
75%	676.000000	1.000000	1.000000	47.500000	1.000000	1.000000	90.000000
max	890.000000	1.000000	3.000000	80.000000	3.000000	4.000000	512.329200

Groupby()

Now let’s try our first groupby. We’ll group the dataset on the sex of the passengers (in those days they apparently only had two).

In [7]:

by_sex = titan.groupby("Sex")
display(by_sex.describe())
display(by_sex['Survived'].describe())
display(by_sex[['Survived', 'Fare', 'SibSp', 'Parch']].mean())

	PassengerId								Survived		...	Parch		Fare
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
Sex
female	314.0	431.028662	256.846324	2.0	231.75	414.5	641.25	889.0	314.0	0.742038	...	1.0	6.0	314.0	44.479818	57.997698	6.75	12.071875	23.0	55.00	512.3292
male	577.0	454.147314	257.486139	1.0	222.00	464.0	680.00	891.0	577.0	0.188908	...	0.0	5.0	577.0	25.523893	43.138263	0.00	7.895800	10.5	26.55	512.3292

2 rows × 56 columns

	count	mean	std	min	25%	50%	75%	max
Sex
female	314.0	0.742038	0.438211	0.0	0.0	1.0	1.0	1.0
male	577.0	0.188908	0.391775	0.0	0.0	0.0	0.0	1.0

	Survived	Fare	SibSp	Parch
Sex
female	0.742038	44.479818	0.694268	0.649682
male	0.188908	25.523893	0.429809	0.235702

Okay, how about a simple count of the number of people, by sex, who survived the sinking.

In [6]:

display(titan.groupby("Sex")["Survived"].count())

Sex
female    314
male      577
Name: Survived, dtype: int64

Now, let’s add the persons age into our grouping of survival counts.

In [7]:

display(titan.groupby(["Sex", "Age"])["Survived"].count())

Sex     Age  
female  0.75     2
        1.00     2
        2.00     6
        3.00     2
        4.00     5
                ..
male    70.00    2
        70.50    1
        71.00    2
        74.00    1
        80.00    1
Name: Survived, Length: 145, dtype: int64

Not so helpful. But we will eventually fix that. But first, let’s do something about the missing data (where appropriate). Just so you know, I am not currently going to worry about using the cabin information nor do anything about any missing cabin information. Another day perhaps.

In [8]:

# Let's look at missing data
display(titan.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Dealing with Missing Data

We’ll start by dealing with the 2 missing embarkation locations. Turns out that with a bit of hunting, we don’t have to guess.

In [9]:

# Ok let's just sort out missing embarked first
display(titan.loc[titan['Embarked'].isnull()])
# searching suitable sources, turns out both embarked at Southhampton
# https://www.encyclopedia-titanica.org/titanic-survivor/amelia-icard.html
# https://www.encyclopedia-titanica.org/titanic-survivor/martha-evelyn-stone.html
titan.loc[titan['Embarked'].isnull(), 'Embarked'] = 'S'

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
61	62	1	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0	B28	NaN
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0	B28	NaN

Because of the large number of missing ages, I was not going to try and search the web for answers. Instead I am going to use some sort of average age. Looking at sex and passenger class, it looks like the median of those groups should work reasonably well.

In [10]:

# Age is a bit harder, but I am going to use the median age of the data grouped on sex and passenger class
print('Median age:')
display(titan.groupby(['Pclass', 'Sex'])['Age'].median())
print('\nCount by group:')
display(titan.groupby(['Pclass', 'Sex'])['Age'].count())
# let's fill the missing values
titan['Age'] = titan.groupby(['Pclass', 'Sex'])['Age'].apply(lambda x: x.fillna(x.median()))
print()
display(titan.isnull().sum())

Median age:

Pclass  Sex   
1       female    35.0
        male      40.0
2       female    28.0
        male      30.0
3       female    21.5
        male      25.0
Name: Age, dtype: float64

Count by group:

Pclass  Sex   
1       female     85
        male      101
2       female     74
        male       99
3       female    102
        male      253
Name: Age, dtype: int64

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

Binning Data

Now, if you remember above trying to group on age doesn’t really work as there are just too many ages. The same will be true when we try to group on fare. So, we will look at creating groups based on suitable ranges for each. And, pandas has some functions to help us out.

We’ll start with a quick look at a box-and-whisker plot for the ‘Age’ data. There are a number of outliers. Something similar looking for fare; but for fare, the range of values is much higher.

In [11]:

# we still need to do some work with Age and Fare
# I propose we create some more meaningful groups to more easily visualize possibilities
titan.boxplot(column=['Age'], figsize=(15,7))

Out[11]:

<AxesSubplot:>

For fare we will split the cases into buckets of a similar size using pd.qcut(). This may cause a touch of overlap, but should provide a reasonable binning. But, for ages it seems more reasonable to use buckets of equal age ranges. Since that would more likely better reflect the odds of survival than ranges based on the number of members as we are doing with fares. For age we will use pd.cut().

In [12]:

# fair number of outliers, similar story for Fare
# so let's cut our data into blocks so the outliers are less of an issue
print(type(titan['Age'][0]))
# ~! rerunning this cell caused errors,
# so had to put the 'cut's in a suitable conditional
if type(titan['Age'][0]) == np.float64:
    titan['Age'] = pd.cut(titan['Age'].astype(int), 5)
    titan['Fare'] = pd.qcut(titan['Fare'], 5)
print(type(titan['Age'][0]))
print("for age, each category has a different number of cases")
print("but each category is of a similar range")
display(titan['Age'].value_counts())
print("for fare, each category has almost the same number of cases")
display(titan['Fare'].value_counts())

<class 'numpy.float64'>
<class 'pandas._libs.interval.Interval'>
for age, each category has a different number of cases
but each category is of a similar range

(16.0, 32.0]     495
(32.0, 48.0]     216
(-0.08, 16.0]    100
(48.0, 64.0]      69
(64.0, 80.0]      11
Name: Age, dtype: int64

for fare, each category has almost the same number of cases

(7.854, 10.5]        184
(21.679, 39.688]     180
(-0.001, 7.854]      179
(39.688, 512.329]    176
(10.5, 21.679]       172
Name: Fare, dtype: int64

Survival Rate for Different Groupings

Now let’s look at the survival rate when grouping our data on different columns.

One More Example

Seems like a fairly large number young children in 3rd class? And, look at the number of young men in 3rd.

Pivot Tables

I was going to continue this post with a look at how some of the above might work if we use pivot tables. But, this post is plenty long enough. So, I will do that in the next one.

Until Next Time

I am sure I just touched the surface of using groupby(). I have a few links in the resource section, but expect there are plenty more to be found searching the web.

If you wish to play with the above, feel free to download my notebook covering the contents of this post.

Too Old To Code

Data Science Basics: pandas — Reshaping/Filtering Data, Group By