Data Science Basics: pandas — Hierarchical Indexing

Published Mon, Jul 12, 2021 by Rick Kochanski

I am a little embarrassed to say that I have been quite enjoying this last stretch of posts. Nothing really productive and a long way from my stated goal of learning about data science and machine learning. But, it has been fun just learning something about these Python packages. And, we aren’t quite done yet.

Hierarchical Indexing

AKA multi-indexing.

I gather this will have some value in future. Time will tell, but it is sort of interesting. And, believe it or not, we had a brief glimpse of multi-indexing in the last post. You may recall we used .stack() on one of the dice roll DataFrames to convert the DataFrame into a Series. We were then able to use .median() to get the median of all the dice rolls in that DataFrame. The resulting Series looked like this:

0 green 3 red 6 blue 1 1 green 5 red 2 blue 1 2 green 4 red 6 blue 5 dtype: int64

You will note that each row has two labels. Essentially we are getting mutli-dimensional data into a more familiar one or two dimensional array. (Note: pandas does provide Panel for a 3D data structure. But using hierarchical indexing is apparently a more common approach in real world practice.)

In the above case we in fact have a two dimensional DataFrame in a Series via multi-indexing. The first label is the original row index and the second is the original column index. Quoting from the pandas documentation.

Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

MultiIndex / advanced indexing

Let’s use that documents approach to create a multi-indexed Series from some raw data. We will use some population data from our old CSV. This could be a little muddled and a bit of work.

Multi-Indexed Series

In [1]:

import numpy as np
import pandas as pd

In [2]:

# let's start with some population data
pop_src = {
    'Australia': {'2015': 23932.498999999993, '2016': 24262.71, '2017': 24584.619, '2018': 24898.153000000002, '2019': 25203.199999999997, '2020': 25499.881}, 
    'Brazil': {'2015': 204471.759, '2016': 206163.05599999995, '2017': 207833.825, '2018': 209469.31999999995, '2019': 211049.51900000003, '2020': 212559.40899999999}, 
    'China': {'2015': 1406847.8679999998, '2016': 1414049.353, '2017': 1421021.7940000002, '2018': 1427647.7889999999, '2019': 1433783.6919999998, '2020': 1439323.7739999997}, 
    'Japan': {'2015': 127985.13900000001, '2016': 127763.267, '2017': 127502.72799999999, '2018': 127202.19, '2019': 126860.29899999998, '2020': 126476.458}, 
    'Russia': {'2015': 144985.059, '2016': 145275.37400000004, '2017': 145530.09099999996, '2018': 145734.03399999999, '2019': 145872.26, '2020': 145934.46}, 
    'USA': {'2015': 320878.3120000001, '2016': 323015.99199999997, '2017': 325084.75800000003, '2018': 327096.263, '2019': 329064.917, '2020': 331002.64699999994}
}
# we need to create our multi-index, using the product approach
iterables = [list(pop_src.keys()), list(pop_src['Australia'].keys())]
print(iterables, '\n')
m_ndx = pd.MultiIndex.from_product(iterables, names=["country", "year"])
print(m_ndx[:13], '\n')
pops = []
for cc in pop_src.keys():
    for pop in pop_src[cc].values():
       pops.append(pop)
#print(pops, '\n')
# and finally
s1 = pd.Series(pops, index=m_ndx)
print(s1)

[['Australia', 'Brazil', 'China', 'Japan', 'Russia', 'USA'], ['2015', '2016', '2017', '2018', '2019', '2020']] 
MultiIndex([('Australia', '2015'),
('Australia', '2016'),
('Australia', '2017'),
('Australia', '2018'),
('Australia', '2019'),
('Australia', '2020'),
(   'Brazil', '2015'),
(   'Brazil', '2016'),
(   'Brazil', '2017'),
(   'Brazil', '2018'),
(   'Brazil', '2019'),
(   'Brazil', '2020'),
(    'China', '2015')],
names=['country', 'year'])
country    year
Australia  2015      23932.499
2016      24262.710
2017      24584.619
2018      24898.153
2019      25203.200
2020      25499.881
Brazil     2015     204471.759
2016     206163.056
2017     207833.825
2018     209469.320
2019     211049.519
2020     212559.409
China      2015    1406847.868
2016    1414049.353
2017    1421021.794
2018    1427647.789
2019    1433783.692
2020    1439323.774
Japan      2015     127985.139
2016     127763.267
2017     127502.728
2018     127202.190
2019     126860.299
2020     126476.458
Russia     2015     144985.059
2016     145275.374
2017     145530.091
2018     145734.034
2019     145872.260
2020     145934.460
USA        2015     320878.312
2016     323015.992
2017     325084.758
2018     327096.263
2019     329064.917
2020     331002.647
dtype: float64

Damn, rubber ring on my mouse’s scroll wheel just bit the dust. Wheel still seems to work. But given how many of the labels have worn off my keyboard and this latest incident, probably time for a new keyboard and mouse. They have been sitting in the closet for the best part of a year and a half.

And, yes we could just have created a DataFrame, but what’s the fun in that. And, we are currently only looking at concepts and techniques in a general way. We may eventually use them in a real world endeavour.

In [29]:

# now let's get the pop for each country for 2020
print(s1[:, '2020'], '\n')
# all countries for the years 2015-2017
print(s1.loc[:, '2015':'2017'], '\n')
# just Brazil, China and Japan for 2015-2017
print(s1.loc['Brazil':'Japan', '2015':'2017'], '\n')

country Australia 25499.881 Brazil 212559.409 China 1439323.774 Japan 126476.458 Russia 145934.460 USA 331002.647 dtype: float64

country year Australia 2015 23932.499 2016 24262.710 2017 24584.619 Brazil 2015 204471.759 2016 206163.056 2017 207833.825 China 2015 1406847.868 2016 1414049.353 2017 1421021.794 Japan 2015 127985.139 2016 127763.267 2017 127502.728 Russia 2015 144985.059 2016 145275.374 2017 145530.091 USA 2015 320878.312 2016 323015.992 2017 325084.758 dtype: float64

country year Brazil 2015 204471.759 2016 206163.056 2017 207833.825 China 2015 1406847.868 2016 1414049.353 2017 1421021.794 Japan 2015 127985.139 2016 127763.267 2017 127502.728 dtype: float64

.stack() & .unstack()

Having said we could have simply used a DataFrame for the above data, it should be said that pandas keeps this equivalence in mind. The .unstack() method will turn our multi-indexed Series into a conventional DataFrame.

In [4]:

# before looking at a mutli-indexed DataFrame, let's look at pandas built in conversion methods
# multi-indexed Series to standard DataFrame
df_pop = s1.unstack()
display(df_pop)
print()
# and back to a multi-indexed Series (we've saw this in the previous notebook)
s2 = df_pop.stack()
display(s2)

year	2015	2016	2017	2018	2019	2020
country
Australia	23932.499	24262.710	24584.619	24898.153	25203.200	25499.881
Brazil	204471.759	206163.056	207833.825	209469.320	211049.519	212559.409
China	1406847.868	1414049.353	1421021.794	1427647.789	1433783.692	1439323.774
Japan	127985.139	127763.267	127502.728	127202.190	126860.299	126476.458
Russia	144985.059	145275.374	145530.091	145734.034	145872.260	145934.460
USA	320878.312	323015.992	325084.758	327096.263	329064.917	331002.647

country    year
Australia  2015      23932.499
           2016      24262.710
           2017      24584.619
           2018      24898.153
           2019      25203.200
           2020      25499.881
Brazil     2015     204471.759
           2016     206163.056
           2017     207833.825
           2018     209469.320
           2019     211049.519
           2020     212559.409
China      2015    1406847.868
           2016    1414049.353
           2017    1421021.794
           2018    1427647.789
           2019    1433783.692
           2020    1439323.774
Japan      2015     127985.139
           2016     127763.267
           2017     127502.728
           2018     127202.190
           2019     126860.299
           2020     126476.458
Russia     2015     144985.059
           2016     145275.374
           2017     145530.091
           2018     145734.034
           2019     145872.260
           2020     145934.460
USA        2015     320878.312
           2016     323015.992
           2017     325084.758
           2018     327096.263
           2019     329064.917
           2020     331002.647
dtype: float64

Multi-Indexed DataFrames

Let’s give a multi-index DataFrame a shot. I am going to have columns for years, but for each country I will have two rows: one for total population and one for population 60+. For convenience’s sake, you can pass a list of arrays into Series() or DataFrame() to build a MultiIndex automatically. But, that approach looks a bit awkward to me. So will stick with the previous method.

In [5]:

# let's give a multi-indexed DataFrame a shot
# population 60+ for our selection of countries and years
p_ge60 = {'Australia': {'2015': 4831.658, '2016': 4974.6140000000005, '2017': 5116.552, '2018': 5258.649000000001, '2019': 5403.465999999999, '2020': 5552.834}, 'Brazil': {'2015': 24442.259, '2016': 25446.526, '2017': 26483.384, '2018': 27558.567, '2019': 28681.155, '2020': 29857.440000000006}, 'China': {'2015': 211366.01499999998, '2016': 219452.97199999995, '2017': 226841.16900000002, '2018': 233956.86699999994, '2019': 241461.77400000003, '2020': 249776.313}, 'Japan': {'2015': 41976.661, '2016': 42336.231999999996, '2017': 42638.05700000001, '2018': 42900.87200000001, '2019': 43151.279, 
'2020': 43411.986999999994}, 'Russia': {'2015': 29022.146999999997, '2016': 29807.817, '2017': 30574.95, '2018': 31317.27, '2019': 32029.122000000003, '2020': 32706.492999999995}, 'USA': {'2015': 65920.97600000001, '2016': 67896.672, '2017': 69861.137, '2018': 
71817.44799999999, '2019': 73769.74099999998, '2020': 75717.94900000001}}
iterables_2 = [list(pop_src.keys()), ['total', '60+']]
cols = list(pop_src['Australia'].keys())
print(iterables_2, '\n')
m_ndx = pd.MultiIndex.from_product(iterables_2, names=["country", "group"])
print(m_ndx[:13], '\n')
pops_2 = []
for cc in iterables[0]:
    tmp_pops = []
    for cy in cols:
       tmp_pops.append(pop_src[cc][cy])
    pops_2.append(tmp_pops)
    tmp_pops = []
    for cy in cols:
       tmp_pops.append(p_ge60[cc][cy])
    pops_2.append(tmp_pops)
#print(pops, '\n')
df_pop = pd.DataFrame(pops_2, index=m_ndx, columns=cols)
display(df_pop)

[['Australia', 'Brazil', 'China', 'Japan', 'Russia', 'USA'], ['total', '60+']] 
MultiIndex([('Australia', 'total'),
('Australia',   '60+'),
(   'Brazil', 'total'),
(   'Brazil',   '60+'),
(    'China', 'total'),
(    'China',   '60+'),
(    'Japan', 'total'),
(    'Japan',   '60+'),
(   'Russia', 'total'),
(   'Russia',   '60+'),
(      'USA', 'total'),
(      'USA',   '60+')],
names=['country', 'group'])

		2015	2016	2017	2018	2019	2020
country	group
Australia	total	23932.499	24262.710	24584.619	24898.153	25203.200	25499.881
Australia	60+	4831.658	4974.614	5116.552	5258.649	5403.466	5552.834
Brazil	total	204471.759	206163.056	207833.825	209469.320	211049.519	212559.409
Brazil	60+	24442.259	25446.526	26483.384	27558.567	28681.155	29857.440
China	total	1406847.868	1414049.353	1421021.794	1427647.789	1433783.692	1439323.774
China	60+	211366.015	219452.972	226841.169	233956.867	241461.774	249776.313
Japan	total	127985.139	127763.267	127502.728	127202.190	126860.299	126476.458
Japan	60+	41976.661	42336.232	42638.057	42900.872	43151.279	43411.987
Russia	total	144985.059	145275.374	145530.091	145734.034	145872.260	145934.460
Russia	60+	29022.147	29807.817	30574.950	31317.270	32029.122	32706.493
USA	total	320878.312	323015.992	325084.758	327096.263	329064.917	331002.647
USA	60+	65920.976	67896.672	69861.137	71817.448	73769.741	75717.949

And, we can build DataFrames incorporating hierarchical indexes on both the rows and columns. Let’s give that a try.

To be fair, collecting the data for this DataFrame was the hardest part of the whole process. I will really need to come up with a better way to do this in future. And, do keep in mind the data must be of the correct shape for the resulting DataFrame (in this case 4 x 6). Not all four of these players had results for the 2021 Waste Management Phoenix Open, so I made up the differnces.

In [6]:

index = pd.MultiIndex.from_product([[2020, 2021], ['pga', 'wmpo']],
                                   names=['year', 'tourney'])
columns = pd.MultiIndex.from_product([['Finau', 'Morikowa', 'Rahm', 'Woodland'], ['Avg Drive', '% Greens']],
                                     names=['player', 'stat'])
print(index)
print()
print(columns)
# hardest part was getting a sample of stats
# pga = PGA Championship, wmpo = Waste Management Phoenix Open
stats = [[305.3, 72.22, 290.5, 70.83, 301.9, 68.06, 293.0, 65.28],
         [301.8, 61.11, 302.3, 68.06, 315.3, 59.72, 311.1, 58.33],
         [315.1, 76.39, 308.1, 73.61, 313.5, 77.78, 323.8, 70.83],
         [311.8, 62.11, 312.3, 67.06, 314.1, 77.78, 321.1, 68.33]]
np_stats = np.array(stats)
golf = pd.DataFrame(np_stats, index=index, columns=columns)
display(golf)

MultiIndex([(2020,  'pga'),
            (2020, 'wmpo'),
            (2021,  'pga'),
            (2021, 'wmpo')],
           names=['year', 'tourney'])
MultiIndex([(   'Finau', 'Avg Drive'),
(   'Finau',  '% Greens'),
('Morikowa', 'Avg Drive'),
('Morikowa',  '% Greens'),
(    'Rahm', 'Avg Drive'),
(    'Rahm',  '% Greens'),
('Woodland', 'Avg Drive'),
('Woodland',  '% Greens')],
names=['player', 'stat'])

	player	Finau		Morikowa		Rahm		Woodland
	stat	Avg Drive	% Greens	Avg Drive	% Greens	Avg Drive	% Greens	Avg Drive	% Greens
year	tourney
2020	pga	305.3	72.22	290.5	70.83	301.9	68.06	293.0	65.28
2020	wmpo	301.8	61.11	302.3	68.06	315.3	59.72	311.1	58.33
2021	pga	315.1	76.39	308.1	73.61	313.5	77.78	323.8	70.83
2021	wmpo	311.8	62.11	312.3	67.06	314.1	77.78	321.1	68.33

And, a quick look at some indexing/slicing of the multi-mutli-indexed DataFrame.

In [7]:

# let's access some bits and pieces
print("Finau's Stats ->")
display(golf['Finau'])
print("\nFinau's and Rahm's Stats ->")
display(golf[['Finau', 'Rahm']])
print("\nFinau's and Rahm's Avg Drive Only ->")
display(golf[['Finau', 'Rahm']].iloc[:, ::2])
print("\nStats for 2021 PGA in Series format ->")
display(golf.iloc[2, :])
print("\nStats for 2021 PGA in DataFrame format ->")
display(golf.iloc[2, :].unstack())

Finau's Stats ->

	stat	Avg Drive	% Greens
year	tourney
2020	pga	305.3	72.22
2020	wmpo	301.8	61.11
2021	pga	315.1	76.39
2021	wmpo	311.8	62.11

Finau's and Rahm's Stats ->

	player	Finau		Rahm
	stat	Avg Drive	% Greens	Avg Drive	% Greens
year	tourney
2020	pga	305.3	72.22	301.9	68.06
2020	wmpo	301.8	61.11	315.3	59.72
2021	pga	315.1	76.39	313.5	77.78
2021	wmpo	311.8	62.11	314.1	77.78

Finau's and Rahm's Avg Drive Only ->

	player	Finau	Rahm
	stat	Avg Drive	Avg Drive
year	tourney
2020	pga	305.3	301.9
2020	wmpo	301.8	315.3
2021	pga	315.1	313.5
2021	wmpo	311.8	314.1

Stats for 2021 PGA in Series format ->

player    stat     
Finau     Avg Drive    315.10
          % Greens      76.39
Morikowa  Avg Drive    308.10
          % Greens      73.61
Rahm      Avg Drive    313.50
          % Greens      77.78
Woodland  Avg Drive    323.80
          % Greens      70.83
Name: (2021, pga), dtype: float64

Stats for 2021 PGA in DataFrame format ->

stat	% Greens	Avg Drive
player
Finau	76.39	315.1
Morikowa	73.61	308.1
Rahm	77.78	313.5
Woodland	70.83	323.8

And, one final approach. Let’s say we want one tournament, all years, all players, with only the % Greens stat. Give it a try with the previous methods. Then have a look at the following.

In [8]:

# for those complex ones pandas provides the IndexSlice object
idx = pd.IndexSlice
display(golf.loc[idx[:, 'wmpo'], idx[:, '% Greens']])

	player	Finau	Morikowa	Rahm	Woodland
	stat	% Greens	% Greens	% Greens	% Greens
year	tourney
2020	wmpo	61.11	68.06	59.72	58.33
2021	wmpo	62.11	67.06	77.78	68.33

Done!

There were other things I had thought I’d get to. But this is getting to be rather long — time to call it a day. Feel free to download my notebook covering the above and play around.

Too Old To Code