I am a little embarrassed to say that I have been quite enjoying this last stretch of posts. Nothing really productive and a long way from my stated goal of learning about data science and machine learning. But, it has been fun just learning something about these Python packages. And, we aren’t quite done yet.
Hierarchical Indexing
AKA multi-indexing.
I gather this will have some value in future. Time will tell, but it is sort of interesting. And, believe it or not, we had a brief glimpse of multi-indexing in the last post. You may recall we used .stack()
on one of the dice roll DataFrames to convert the DataFrame into a Series. We were then able to use .median()
to get the median of all the dice rolls in that DataFrame. The resulting Series looked like this:
0 green 3
red 6
blue 1
1 green 5
red 2
blue 1
2 green 4
red 6
blue 5
dtype: int64
You will note that each row has two labels. Essentially we are getting mutli-dimensional data into a more familiar one or two dimensional array. (Note: pandas does provide Panel
for a 3D data structure. But using hierarchical indexing is apparently a more common approach in real world practice.)
In the above case we in fact have a two dimensional DataFrame in a Series via multi-indexing. The first label is the original row index and the second is the original column index. Quoting from the pandas documentation.
Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).
MultiIndex / advanced indexing
Let’s use that documents approach to create a multi-indexed Series from some raw data. We will use some population data from our old CSV. This could be a little muddled and a bit of work.
Multi-Indexed Series
[['Australia', 'Brazil', 'China', 'Japan', 'Russia', 'USA'], ['2015', '2016', '2017', '2018', '2019', '2020']]
MultiIndex([('Australia', '2015'),
('Australia', '2016'),
('Australia', '2017'),
('Australia', '2018'),
('Australia', '2019'),
('Australia', '2020'),
( 'Brazil', '2015'),
( 'Brazil', '2016'),
( 'Brazil', '2017'),
( 'Brazil', '2018'),
( 'Brazil', '2019'),
( 'Brazil', '2020'),
( 'China', '2015')],
names=['country', 'year'])
country year
Australia 2015 23932.499
2016 24262.710
2017 24584.619
2018 24898.153
2019 25203.200
2020 25499.881
Brazil 2015 204471.759
2016 206163.056
2017 207833.825
2018 209469.320
2019 211049.519
2020 212559.409
China 2015 1406847.868
2016 1414049.353
2017 1421021.794
2018 1427647.789
2019 1433783.692
2020 1439323.774
Japan 2015 127985.139
2016 127763.267
2017 127502.728
2018 127202.190
2019 126860.299
2020 126476.458
Russia 2015 144985.059
2016 145275.374
2017 145530.091
2018 145734.034
2019 145872.260
2020 145934.460
USA 2015 320878.312
2016 323015.992
2017 325084.758
2018 327096.263
2019 329064.917
2020 331002.647
dtype: float64
Damn, rubber ring on my mouse’s scroll wheel just bit the dust. Wheel still seems to work. But given how many of the labels have worn off my keyboard and this latest incident, probably time for a new keyboard and mouse. They have been sitting in the closet for the best part of a year and a half.
And, yes we could just have created a DataFrame, but what’s the fun in that. And, we are currently only looking at concepts and techniques in a general way. We may eventually use them in a real world endeavour.
country
Australia 25499.881
Brazil 212559.409
China 1439323.774
Japan 126476.458
Russia 145934.460
USA 331002.647
dtype: float64
country year
Australia 2015 23932.499
2016 24262.710
2017 24584.619
Brazil 2015 204471.759
2016 206163.056
2017 207833.825
China 2015 1406847.868
2016 1414049.353
2017 1421021.794
Japan 2015 127985.139
2016 127763.267
2017 127502.728
Russia 2015 144985.059
2016 145275.374
2017 145530.091
USA 2015 320878.312
2016 323015.992
2017 325084.758
dtype: float64
country year
Brazil 2015 204471.759
2016 206163.056
2017 207833.825
China 2015 1406847.868
2016 1414049.353
2017 1421021.794
Japan 2015 127985.139
2016 127763.267
2017 127502.728
dtype: float64
.stack() & .unstack()
Having said we could have simply used a DataFrame for the above data, it should be said that pandas keeps this equivalence in mind. The .unstack()
method will turn our multi-indexed Series into a conventional DataFrame.
year | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 |
---|
country | | | | | | |
---|
Australia | 23932.499 | 24262.710 | 24584.619 | 24898.153 | 25203.200 | 25499.881 |
---|
Brazil | 204471.759 | 206163.056 | 207833.825 | 209469.320 | 211049.519 | 212559.409 |
---|
China | 1406847.868 | 1414049.353 | 1421021.794 | 1427647.789 | 1433783.692 | 1439323.774 |
---|
Japan | 127985.139 | 127763.267 | 127502.728 | 127202.190 | 126860.299 | 126476.458 |
---|
Russia | 144985.059 | 145275.374 | 145530.091 | 145734.034 | 145872.260 | 145934.460 |
---|
USA | 320878.312 | 323015.992 | 325084.758 | 327096.263 | 329064.917 | 331002.647 |
---|
country year
Australia 2015 23932.499
2016 24262.710
2017 24584.619
2018 24898.153
2019 25203.200
2020 25499.881
Brazil 2015 204471.759
2016 206163.056
2017 207833.825
2018 209469.320
2019 211049.519
2020 212559.409
China 2015 1406847.868
2016 1414049.353
2017 1421021.794
2018 1427647.789
2019 1433783.692
2020 1439323.774
Japan 2015 127985.139
2016 127763.267
2017 127502.728
2018 127202.190
2019 126860.299
2020 126476.458
Russia 2015 144985.059
2016 145275.374
2017 145530.091
2018 145734.034
2019 145872.260
2020 145934.460
USA 2015 320878.312
2016 323015.992
2017 325084.758
2018 327096.263
2019 329064.917
2020 331002.647
dtype: float64
Multi-Indexed DataFrames
Let’s give a multi-index DataFrame a shot. I am going to have columns for years, but for each country I will have two rows: one for total population and one for population 60+. For convenience’s sake, you can pass a list of arrays into Series()
or DataFrame()
to build a MultiIndex automatically. But, that approach looks a bit awkward to me. So will stick with the previous method.
[['Australia', 'Brazil', 'China', 'Japan', 'Russia', 'USA'], ['total', '60+']]
MultiIndex([('Australia', 'total'),
('Australia', '60+'),
( 'Brazil', 'total'),
( 'Brazil', '60+'),
( 'China', 'total'),
( 'China', '60+'),
( 'Japan', 'total'),
( 'Japan', '60+'),
( 'Russia', 'total'),
( 'Russia', '60+'),
( 'USA', 'total'),
( 'USA', '60+')],
names=['country', 'group'])
| | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 |
---|
country | group | | | | | | |
---|
Australia | total | 23932.499 | 24262.710 | 24584.619 | 24898.153 | 25203.200 | 25499.881 |
---|
60+ | 4831.658 | 4974.614 | 5116.552 | 5258.649 | 5403.466 | 5552.834 |
---|
Brazil | total | 204471.759 | 206163.056 | 207833.825 | 209469.320 | 211049.519 | 212559.409 |
---|
60+ | 24442.259 | 25446.526 | 26483.384 | 27558.567 | 28681.155 | 29857.440 |
---|
China | total | 1406847.868 | 1414049.353 | 1421021.794 | 1427647.789 | 1433783.692 | 1439323.774 |
---|
60+ | 211366.015 | 219452.972 | 226841.169 | 233956.867 | 241461.774 | 249776.313 |
---|
Japan | total | 127985.139 | 127763.267 | 127502.728 | 127202.190 | 126860.299 | 126476.458 |
---|
60+ | 41976.661 | 42336.232 | 42638.057 | 42900.872 | 43151.279 | 43411.987 |
---|
Russia | total | 144985.059 | 145275.374 | 145530.091 | 145734.034 | 145872.260 | 145934.460 |
---|
60+ | 29022.147 | 29807.817 | 30574.950 | 31317.270 | 32029.122 | 32706.493 |
---|
USA | total | 320878.312 | 323015.992 | 325084.758 | 327096.263 | 329064.917 | 331002.647 |
---|
60+ | 65920.976 | 67896.672 | 69861.137 | 71817.448 | 73769.741 | 75717.949 |
---|
And, we can build DataFrames incorporating hierarchical indexes on both the rows and columns. Let’s give that a try.
To be fair, collecting the data for this DataFrame was the hardest part of the whole process. I will really need to come up with a better way to do this in future. And, do keep in mind the data must be of the correct shape for the resulting DataFrame (in this case 4 x 6). Not all four of these players had results for the 2021 Waste Management Phoenix Open, so I made up the differnces.
MultiIndex([(2020, 'pga'),
(2020, 'wmpo'),
(2021, 'pga'),
(2021, 'wmpo')],
names=['year', 'tourney'])
MultiIndex([( 'Finau', 'Avg Drive'),
( 'Finau', '% Greens'),
('Morikowa', 'Avg Drive'),
('Morikowa', '% Greens'),
( 'Rahm', 'Avg Drive'),
( 'Rahm', '% Greens'),
('Woodland', 'Avg Drive'),
('Woodland', '% Greens')],
names=['player', 'stat'])
| player | Finau | Morikowa | Rahm | Woodland |
---|
| stat | Avg Drive | % Greens | Avg Drive | % Greens | Avg Drive | % Greens | Avg Drive | % Greens |
---|
year | tourney | | | | | | | | |
---|
2020 | pga | 305.3 | 72.22 | 290.5 | 70.83 | 301.9 | 68.06 | 293.0 | 65.28 |
---|
wmpo | 301.8 | 61.11 | 302.3 | 68.06 | 315.3 | 59.72 | 311.1 | 58.33 |
---|
2021 | pga | 315.1 | 76.39 | 308.1 | 73.61 | 313.5 | 77.78 | 323.8 | 70.83 |
---|
wmpo | 311.8 | 62.11 | 312.3 | 67.06 | 314.1 | 77.78 | 321.1 | 68.33 |
---|
And, a quick look at some indexing/slicing of the multi-mutli-indexed DataFrame.
| stat | Avg Drive | % Greens |
---|
year | tourney | | |
---|
2020 | pga | 305.3 | 72.22 |
---|
wmpo | 301.8 | 61.11 |
---|
2021 | pga | 315.1 | 76.39 |
---|
wmpo | 311.8 | 62.11 |
---|
Finau's and Rahm's Stats ->
| player | Finau | Rahm |
---|
| stat | Avg Drive | % Greens | Avg Drive | % Greens |
---|
year | tourney | | | | |
---|
2020 | pga | 305.3 | 72.22 | 301.9 | 68.06 |
---|
wmpo | 301.8 | 61.11 | 315.3 | 59.72 |
---|
2021 | pga | 315.1 | 76.39 | 313.5 | 77.78 |
---|
wmpo | 311.8 | 62.11 | 314.1 | 77.78 |
---|
Finau's and Rahm's Avg Drive Only ->
| player | Finau | Rahm |
---|
| stat | Avg Drive | Avg Drive |
---|
year | tourney | | |
---|
2020 | pga | 305.3 | 301.9 |
---|
wmpo | 301.8 | 315.3 |
---|
2021 | pga | 315.1 | 313.5 |
---|
wmpo | 311.8 | 314.1 |
---|
Stats for 2021 PGA in Series format ->
player stat
Finau Avg Drive 315.10
% Greens 76.39
Morikowa Avg Drive 308.10
% Greens 73.61
Rahm Avg Drive 313.50
% Greens 77.78
Woodland Avg Drive 323.80
% Greens 70.83
Name: (2021, pga), dtype: float64
Stats for 2021 PGA in DataFrame format ->
stat | % Greens | Avg Drive |
---|
player | | |
---|
Finau | 76.39 | 315.1 |
---|
Morikowa | 73.61 | 308.1 |
---|
Rahm | 77.78 | 313.5 |
---|
Woodland | 70.83 | 323.8 |
---|
And, one final approach. Let’s say we want one tournament, all years, all players, with only the % Greens stat. Give it a try with the previous methods. Then have a look at the following.
| player | Finau | Morikowa | Rahm | Woodland |
---|
| stat | % Greens | % Greens | % Greens | % Greens |
---|
year | tourney | | | | |
---|
2020 | wmpo | 61.11 | 68.06 | 59.72 | 58.33 |
---|
2021 | wmpo | 62.11 | 67.06 | 77.78 | 68.33 |
---|
Done!
There were other things I had thought I’d get to. But this is getting to be rather long — time to call it a day. Feel free to download my notebook covering the above and play around.
Resources