This is likely going to be an inadequate post. I am going to try to cover groupby
and pivot
in a single post. Something that should likely take 3 or 4 posts to really cover adequately. But, there are tons of articles/posts out there covering these topics. And, I really just want to get an idea of the basics, not the intracacies. Always time for the latter when I get working on real life projects.
That said, as near as I can tell, pivot tables are groupby on steroids. Let’s find out. Though I used Excel in a variety of ways personally and at work, I somehow never got into using pivot tables. Don’t need, don’t spend/waste time?
To be clear we are talking about re-shaping our data and about efficient summarization (aggregation) of that data whatever shape it is in currently.
Aggregation? You know: sum(), mean(), median(), min(), max(), std()
and the like. And perhaps, one well worth knowing about describe()
Simple Aggregation
Let’s have a look at some simple examples. Very similar to what we previously covered (NumPy and pandas series and dataframes). We’ll start with series, then look at dataframes.
from IPython.display import display
import numpy as np
import pandas as pd
import seaborn as sns
# aggregaton on series returns single value
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
display(ser)
print(f"\nseries sum: {ser.sum()}")
print(f"series mean: {ser.mean()}")
print(f"number of items in series: {ser.count()}")
# for a dataframe you, by default, get a value for each column, this can be altered
df = pd.DataFrame({'A': rng.rand(5),
'B': rng.rand(5)})
display(df)
print(f"\ndataframe sum:\n{df.sum()}")
print(f"\ndataframe mean:\n{df.mean()}")
print(f"\nnumber of items in columns:\n{df.count()}")
# but we can get it to operate on the rows by specifying an axis
print(f"\ndataframe sum by row:\n{df.sum(axis='columns')}")
print(f"\ndataframe mean by row:\n{df.mean(axis='columns')}")
Nuff said!
A Real Life Dataset
Let’s look at something a touch more realistic. We’ll use one of the more common tutorial datasets: the Titanic dataset. I am going to use the training set from Kaggle — “the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works”. I have downloaded it to a local directory.
There are also other sources should you not wish to sign-up with Kaggle. e.g. Seaborn has a builtin version, though with a slightly different set of features.
# Let's try a more complicated dataset: Titanic
titan = pd.read_csv('./data/titanic/train.csv')
display(titan.head(5))
Mentioned the describe()
method above. Let’s have a look. Do note, categorical features are not included in the result. The second view drops any rows containing ’n/a’ (null) values. Of which there are apparently quite a few.
# Let's try that describe() method
display(titan.describe())
print()
display(titan.dropna().describe())
Looking at the above you we know that in the training set:
- there are 891 passengers
- 38% of those survived the sinking of the Titanic
- their ages ranged from 0.4 to 80
- we are missing data in at least the 'Age' column
Groupby()
Now let’s try our first groupby
. We’ll group the dataset on the sex of the passengers (in those days they apparently only had two).
by_sex = titan.groupby("Sex")
display(by_sex.describe())
display(by_sex['Survived'].describe())
display(by_sex[['Survived', 'Fare', 'SibSp', 'Parch']].mean())
Okay, how about a simple count of the number of people, by sex, who survived the sinking.
display(titan.groupby("Sex")["Survived"].count())
Now, let’s add the persons age into our grouping of survival counts.
display(titan.groupby(["Sex", "Age"])["Survived"].count())
Not so helpful. But we will eventually fix that. But first, let’s do something about the missing data (where appropriate). Just so you know, I am not currently going to worry about using the cabin information nor do anything about any missing cabin information. Another day perhaps.
# Let's look at missing data
display(titan.isnull().sum())
Dealing with Missing Data
We’ll start by dealing with the 2 missing embarkation locations. Turns out that with a bit of hunting, we don’t have to guess.
# Ok let's just sort out missing embarked first
display(titan.loc[titan['Embarked'].isnull()])
# searching suitable sources, turns out both embarked at Southhampton
# https://www.encyclopedia-titanica.org/titanic-survivor/amelia-icard.html
# https://www.encyclopedia-titanica.org/titanic-survivor/martha-evelyn-stone.html
titan.loc[titan['Embarked'].isnull(), 'Embarked'] = 'S'
Because of the large number of missing ages, I was not going to try and search the web for answers. Instead I am going to use some sort of average age. Looking at sex and passenger class, it looks like the median of those groups should work reasonably well.
# Age is a bit harder, but I am going to use the median age of the data grouped on sex and passenger class
print('Median age:')
display(titan.groupby(['Pclass', 'Sex'])['Age'].median())
print('\nCount by group:')
display(titan.groupby(['Pclass', 'Sex'])['Age'].count())
# let's fill the missing values
titan['Age'] = titan.groupby(['Pclass', 'Sex'])['Age'].apply(lambda x: x.fillna(x.median()))
print()
display(titan.isnull().sum())
Binning Data
Now, if you remember above trying to group on age doesn’t really work as there are just too many ages. The same will be true when we try to group on fare. So, we will look at creating groups based on suitable ranges for each. And, pandas has some functions to help us out.
We’ll start with a quick look at a box-and-whisker plot for the ‘Age’ data. There are a number of outliers. Something similar looking for fare; but for fare, the range of values is much higher.
# we still need to do some work with Age and Fare
# I propose we create some more meaningful groups to more easily visualize possibilities
titan.boxplot(column=['Age'], figsize=(15,7))
For fare we will split the cases into buckets of a similar size using pd.qcut()
. This may cause a touch of overlap, but should provide a reasonable binning. But, for ages it seems more reasonable to use buckets of equal age ranges. Since that would more likely better reflect the odds of survival than ranges based on the number of members as we are doing with fares. For age we will use pd.cut()
.
# fair number of outliers, similar story for Fare
# so let's cut our data into blocks so the outliers are less of an issue
print(type(titan['Age'][0]))
# ~! rerunning this cell caused errors,
# so had to put the 'cut's in a suitable conditional
if type(titan['Age'][0]) == np.float64:
titan['Age'] = pd.cut(titan['Age'].astype(int), 5)
titan['Fare'] = pd.qcut(titan['Fare'], 5)
print(type(titan['Age'][0]))
print("for age, each category has a different number of cases")
print("but each category is of a similar range")
display(titan['Age'].value_counts())
print("for fare, each category has almost the same number of cases")
display(titan['Fare'].value_counts())
Survival Rate for Different Groupings
Now let’s look at the survival rate when grouping our data on different columns.