As I mentioned in the last post:
Now, the problem is I have no real idea how to build that DataFrame I am after. Am hoping that pandas can somehow help me get there without me having to build some convoluted dictionary or list to load into a DataFrame. Am currently thinking I might be able to build individual series and somehow join them all together. Then perhaps somehow modify the indexes to get what I want.
Data Science Basics: pandas — Getting Golf Stats from Web Site
But, here we go. Starting new notebook. Adding the necessary imports (at least that I know of so far). I will also bring over some of the variables from the previous notebook. Then I am going to load two CSV files into DataFrames or Series and see if I can combine them into one DataFrame somehow.
Basic Notebook Setup
Okay, let’s get the basic setup in place.
import numpy as np
import pandas as pd
import re
events = {'wmpo': 't003', 'tpc': 'to11', 'rbch': 't012', 'masters': 't014', 'pga': 't033'}
stats = {'drv': '101', 'gir': '103', 't2g': '02674'}
st_cols = {'drv': ['PLAYER NAME', 'AVG.', 'TOTAL DISTANCE', 'TOTAL DRIVES'],
'gir': ['PLAYER NAME', '%', 'GREENS HIT', '# HOLES', 'RELATIVE/PAR'],
't2g': ['PLAYER NAME', 'AVERAGE', 'SG:OTT', 'SG:APR', 'SG:ARG']
}
Read in Data and Attempt to Convert to Desired Shape
I started by defining a variable for the data directory and a list of the CSV files I was going to process. I then read the first two files (PGA Championships, 2020, drive and gir stats). Printing some info and a few rows of each to see what we have.
d_dir = "./data/"
csvs = ['pga_2020_drv_2.test.csv', 'pga_2020_gir_2.test.csv',
'pga_2021_drv_2.test.csv', 'pga_2021_gir_2.test.csv',
'rbch_2020_drv_2.test.csv', 'rbch_2020_gir_2.test.csv',
'rbch_2021_drv_2.test.csv', 'rbch_2021_gir_2.test.csv'
]
df1 = pd.read_csv(f"{d_dir}{csvs[0]}", index_col=['PLAYER NAME'], usecols=['AVG.', 'PLAYER NAME'])
df2 = pd.read_csv(f"{d_dir}{csvs[1]}", index_col=['PLAYER NAME'], usecols=['%', 'PLAYER NAME'])
print(f"df1: {len(df1)}, df2: {len(df2)}\n")
print(df1.head())
print()
print(df2.head())
print()
print(type(df2))
Next I tried concatenating the two DataFrames. Once again printing stuff out to see what was produced.
combined = pd.concat([df1, df2])
print(combined.head())
print()
print(combined.tail())
print()
print(len(combined))
pd.notna(combined["AVG."])
Definitely not what was wanted. A little looking about led me to the axis=
parameter of the concat()
method. So gave that a try.
combined = pd.concat([df1, df2], axis=1)
print(combined.head())
print()
print(combined.tail())
Now that looks a bit better. But I would like to use the player name and the stat type as a multi-index. Then use the year and tournament as a multi-index on the columns.
Add Multi-index
This took a bit of research and trial & error. Not to mention that my solution seems pretty convoluted. But it also seems to work. I converted each DataFrame (read from CSV) into a Series. Then back to a DataFrame specifying a column mutli-index. Then I concatenated the new DataFrames. Then a quick index sort to get things looking about right. Also tried getting stats for single player, DeChambeau.
s1 = df1.stack()
cols1 = pd.MultiIndex.from_tuples([('2020', 'pga')])
#print(s1.index)
#print(s1.columns)
df1a = pd.DataFrame(s1, columns=cols1)
print('\ndf1a ->')
display(df1a)
s2 = df2.stack()
df2a = pd.DataFrame(s2, columns=cols1)
df_tst = pd.concat([df1a, df2a])
print('\ndf_tst.head(10) ->')
display(df_tst.head(10))
df_tst = df_tst.sort_index()
#print(f"\ntype(df_tst): {type(df_tst)}\n")
print('\ndf_tst, after sorting index column ->')
display(df_tst)
print("\ndf_tst.loc['Bryson DeChambeau', :] ->")
print(df_tst.loc['Bryson DeChambeau', :])
Relabel Indices
That’s looking better. But really need to do something about the stat type labels. And maybe get rid of the all caps for the index label.
df1.rename(columns={'AVG.': 'drv', '%': 'gir'}, inplace=True)
df1.rename_axis(index={'PLAYER NAME': 'player'}, inplace=True)
print('\ndf1 with columns renamed ->\n')
display(df1.head())
#cols = pd.MultiIndex.from_tuples([("lang",), ("2018","height"), ("2018","weight"), ("2019","height"), ("2019","weight")])
cols1 = pd.MultiIndex.from_tuples([('2020', 'pga')])
s1 = df1.stack()
#print(s1.index)
#print(s1.columns)
df1a = pd.DataFrame(s1, columns=cols1)
print('\ndf1a (multi-indexed) ->\n')
display(df1a)
df2.rename(columns={'AVG.': 'drv', '%': 'gir'}, inplace=True)
df2.rename_axis(index={'PLAYER NAME': 'player'}, inplace=True)
s2 = df2.stack()
df2a = pd.DataFrame(s2, columns=cols1)
df_tst = pd.concat([df1a, df2a])
df_tst = df_tst.sort_index()
#print(f"\ntype(df_tst): {type(df_tst)}\n")
print('\ndf_tst (concat of df1a / df2a, sorted?) ->\n')
display(df_tst)
Add Index Label
We don’t seem to have a label for the stat type column in the multi-index. So let’s fix that.
# Add label for stat column of multi-index
df_tst.rename_axis(['player', 'stat'], inplace=True)
print('\ndf_tst (added index column name) ->\n')
display(df_tst)
print("\ndf_tst.loc['Bryson DeChambeau', :] ->\n")
display(df_tst.loc['Bryson DeChambeau', :])
Add Next Tournament for 2020
Okay, now let’s do that all over for the 2nd tournament of 2020, RBC Heritage. Then combine the two tournaments in one DataFrame and see what we get. Given the change in the order the CSVs will be processed, I also reworked the appropriate list.
# change order of csv files to reflect order I want in dataframe
csvs = ['pga_2020_drv_2.test.csv', 'pga_2020_gir_2.test.csv',
'rbch_2020_drv_2.test.csv', 'rbch_2020_gir_2.test.csv',
'pga_2021_drv_2.test.csv', 'pga_2021_gir_2.test.csv',
'rbch_2021_drv_2.test.csv', 'rbch_2021_gir_2.test.csv'
]
df3 = pd.read_csv(f"{d_dir}{csvs[2]}", index_col=['PLAYER NAME'], usecols=['AVG.', 'PLAYER NAME'])
df4 = pd.read_csv(f"{d_dir}{csvs[3]}", index_col=['PLAYER NAME'], usecols=['%', 'PLAYER NAME'])
cols1 = pd.MultiIndex.from_tuples([('2020', 'rbch')])
df3.rename(columns={'AVG.': 'drv', '%': 'gir'}, inplace=True)
df3.rename_axis(index={'PLAYER NAME': 'player'}, inplace=True)
s3 = df3.stack()
df3a = pd.DataFrame(s3, columns=cols1)
df4.rename(columns={'AVG.': 'drv', '%': 'gir'}, inplace=True)
df4.rename_axis(index={'PLAYER NAME': 'player'}, inplace=True)
s4 = df4.stack()
df4a = pd.DataFrame(s4, columns=cols1)
df_tst2 = pd.concat([df3a, df4a])
df_tst2.rename_axis(['player', 'stat'], inplace=True)
ndx_sort2 = sorted(df_tst2.index,key=lambda x: re.split(r'\W+', x[0])[-1])
df_tst2 = df_tst2.reindex(ndx_sort2)
display(df_tst2)
# now lets concatenate our two stats dfs
df_comb = pd.concat([df_tst, df_tst2])
print('\ndf_comb.head(10) ->\n')
display(df_comb.head(10))
print(f"\nlen(df_comb): {len(df_comb)}")
print('\ndf_comb.dropna() ->\n')
print(df_comb.dropna())
Looks like no one playing in the 2020 RBC Heritage also played in the 2020 PGA Champsionships. Guess I didn’t make the best choice of tournaments to use. But on closer examination things are not what they seem!
Not Quite Right
print("\ndf_comb.dropna(subset=[('2020', 'pga')]) ->\n")
display(df_comb.dropna(how='any', subset=[('2020', 'pga')]))
print("\ndf_comb.dropna(subset=[('2020', 'rbch')]) ->\n")
cols= [(c0, c1) for (c0, c1) in df_comb.columns if c1 in ['rbch']]
display(df_comb.dropna(how='any', subset=cols))
# what's going on
print("\ndf_comb.loc['Abraham Ancer'] ->\n")
print(df_comb.loc['Abraham Ancer'])
Definitely not what we want.
concat(), join() or merge()
Guess I should have looked in tutorials on combining datasets before I undertook this experiment. After some more messing around, I managed to get pandas.merge()
to produce something like what I wanted. At least I think so.
df_comb = pd.merge(df_tst, df_tst2, how='outer', on=['player', 'stat'])
ndx_sort3 = sorted(df_comb.index, key=lambda x: re.split(r'\W+', x[0])[-1])
df_comb = df_comb.reindex(ndx_sort3)
display(df_comb)
display(df_comb.dropna())
Timeout Needed
I have spent a number of hours on this today. So, I think I will call it quits on this post.
I will regroup, start another notebook/post to try and finish this with all four tournament stats in one DataFrame. May look at using some functions to reduce the code I am generating, etc.
If you wish to play with the above, feel free to download my notebook covering the contents of this post.
Resources
- pandas: Rename columns / index names (labels) of DataFrame *
- Pandas Index Explained
- pandas.DataFrame.rename_axis
- pandas.DataFrame.reindex
- Python Pandas: How to set the name of multiindex? *
- pandas sort rows by index using a custom function *
- re — Regular expression operations *
- Pandas Sort: Your Guide to Sorting Data in Python
- Filter out rows with missing data (NaN, None, NaT)
- Pandas, how to dropna values using subset with multiindex dataframe?
- pandas.DataFrame.dropna
- Merge, join, and concatenate *
I had a look at these while troubleshooting my problems/bugs, but not sure how useful they may be for everyone else.
- Multi-Index Sorting in Pandas
- Sort pandas DataFrame with MultiIndex according to column value
- Multi Index Sorting in Pandas
- How to import CSV file with multi-level columns (Python Basics)