As I mentioned in the last post:

Now, the problem is I have no real idea how to build that DataFrame I am after. Am hoping that pandas can somehow help me get there without me having to build some convoluted dictionary or list to load into a DataFrame. Am currently thinking I might be able to build individual series and somehow join them all together. Then perhaps somehow modify the indexes to get what I want.

Data Science Basics: pandas — Getting Golf Stats from Web Site

But, here we go. Starting new notebook. Adding the necessary imports (at least that I know of so far). I will also bring over some of the variables from the previous notebook. Then I am going to load two CSV files into DataFrames or Series and see if I can combine them into one DataFrame somehow.

Basic Notebook Setup

Okay, let’s get the basic setup in place.

In [1]:
import numpy as np
import pandas as pd
import re
In [2]:
events = {'wmpo': 't003', 'tpc': 'to11', 'rbch': 't012', 'masters': 't014', 'pga': 't033'}
stats = {'drv': '101', 'gir': '103', 't2g': '02674'}
st_cols = {'drv': ['PLAYER NAME', 'AVG.', 'TOTAL DISTANCE', 'TOTAL DRIVES'],
           'gir': ['PLAYER NAME', '%', 'GREENS HIT', '# HOLES', 'RELATIVE/PAR'],
           't2g': ['PLAYER NAME', 'AVERAGE', 'SG:OTT', 'SG:APR', 'SG:ARG']
}

Read in Data and Attempt to Convert to Desired Shape

I started by defining a variable for the data directory and a list of the CSV files I was going to process. I then read the first two files (PGA Championships, 2020, drive and gir stats). Printing some info and a few rows of each to see what we have.

In [3]:
d_dir = "./data/"
csvs = ['pga_2020_drv_2.test.csv', 'pga_2020_gir_2.test.csv',
        'pga_2021_drv_2.test.csv', 'pga_2021_gir_2.test.csv',
        'rbch_2020_drv_2.test.csv', 'rbch_2020_gir_2.test.csv',
        'rbch_2021_drv_2.test.csv', 'rbch_2021_gir_2.test.csv'
]
df1 = pd.read_csv(f"{d_dir}{csvs[0]}", index_col=['PLAYER NAME'], usecols=['AVG.', 'PLAYER NAME'])
df2 = pd.read_csv(f"{d_dir}{csvs[1]}", index_col=['PLAYER NAME'], usecols=['%', 'PLAYER NAME'])
print(f"df1: {len(df1)}, df2: {len(df2)}\n")
print(df1.head())
print()
print(df2.head())
print()
print(type(df2))
df1: 79, df2: 79
                AVG.

PLAYER NAME
Cameron Champ 321.1 Bryson DeChambeau 318.1 Rory McIlroy 312.5 Sepp Straka 305.8 Tommy Fleetwood 305.5

                  %

PLAYER NAME
Matthew Wolff 77.78 Paul Casey 76.39 Jason Day 76.39 Louis Oosthuizen 73.61 Cameron Champ 73.61

<class 'pandas.core.frame.DataFrame'>

Next I tried concatenating the two DataFrames. Once again printing stuff out to see what was produced.

In [4]:
combined = pd.concat([df1, df2])
print(combined.head())
print()
print(combined.tail())
print()
print(len(combined))
                    AVG.   %
PLAYER NAME                 
Cameron Champ      321.1 NaN
Bryson DeChambeau  318.1 NaN
Rory McIlroy       312.5 NaN
Sepp Straka        305.8 NaN
Tommy Fleetwood    305.5 NaN
             AVG.      %

PLAYER NAME
Denny McCarthy NaN 54.17 Harris English NaN 52.78 Brian Harman NaN 51.39 Brandt Snedeker NaN 51.39 J.T. Poston NaN 50.00

158

In [5]:
pd.notna(combined["AVG."])
Out[5]:
PLAYER NAME
Cameron Champ         True
Bryson DeChambeau     True
Rory McIlroy          True
Sepp Straka           True
Tommy Fleetwood       True
                     ...  
Denny McCarthy       False
Harris English       False
Brian Harman         False
Brandt Snedeker      False
J.T. Poston          False
Name: AVG., Length: 158, dtype: bool

Definitely not what was wanted. A little looking about led me to the axis= parameter of the concat() method. So gave that a try.

In [6]:
combined = pd.concat([df1, df2], axis=1)
print(combined.head())
print()
print(combined.tail())
                    AVG.      %
PLAYER NAME                    
Cameron Champ      321.1  73.61
Bryson DeChambeau  318.1  66.67
Rory McIlroy       312.5  65.28
Sepp Straka        305.8  58.33
Tommy Fleetwood    305.5  63.89
               AVG.      %

PLAYER NAME
Charl Schwartzel 276.0 59.72 Chez Reavie 274.5 55.56 Brendon Todd 272.3 62.50 Patrick Reed 271.6 68.06 Mark Hubbard 268.8 61.11

Now that looks a bit better. But I would like to use the player name and the stat type as a multi-index. Then use the year and tournament as a multi-index on the columns.

Add Multi-index

This took a bit of research and trial & error. Not to mention that my solution seems pretty convoluted. But it also seems to work. I converted each DataFrame (read from CSV) into a Series. Then back to a DataFrame specifying a column mutli-index. Then I concatenated the new DataFrames. Then a quick index sort to get things looking about right. Also tried getting stats for single player, DeChambeau.

In [7]:
s1 = df1.stack()
cols1 = pd.MultiIndex.from_tuples([('2020', 'pga')])
#print(s1.index)
#print(s1.columns)
df1a = pd.DataFrame(s1, columns=cols1)
print('\ndf1a ->')
display(df1a)
s2 = df2.stack()
df2a = pd.DataFrame(s2, columns=cols1)
df_tst = pd.concat([df1a, df2a])
print('\ndf_tst.head(10) ->')
display(df_tst.head(10))
df_tst = df_tst.sort_index()
#print(f"\ntype(df_tst): {type(df_tst)}\n")
print('\ndf_tst, after sorting index column ->')
display(df_tst)
print("\ndf_tst.loc['Bryson DeChambeau', :] ->")
print(df_tst.loc['Bryson DeChambeau', :])

df1a ->

2020
pga
PLAYER NAME
Cameron ChampAVG.321.1
Bryson DeChambeauAVG.318.1
Rory McIlroyAVG.312.5
Sepp StrakaAVG.305.8
Tommy FleetwoodAVG.305.5
.........
Charl SchwartzelAVG.276.0
Chez ReavieAVG.274.5
Brendon ToddAVG.272.3
Patrick ReedAVG.271.6
Mark HubbardAVG.268.8

79 rows × 1 columns

df_tst.head(10) ->
2020
pga
PLAYER NAME
Cameron ChampAVG.321.1
Bryson DeChambeauAVG.318.1
Rory McIlroyAVG.312.5
Sepp StrakaAVG.305.8
Tommy FleetwoodAVG.305.5
Tony FinauAVG.305.3
Dustin JohnsonAVG.305.0
Victor PerezAVG.304.4
Haotong LiAVG.304.1
Tiger WoodsAVG.304.0

df_tst, after sorting index column ->

2020
pga
PLAYER NAME
Abraham Ancer%63.89
AVG.295.60
Adam Hadwin%54.17
AVG.282.90
Adam Long%59.72
.........
Viktor HovlandAVG.281.90
Webb Simpson%62.50
AVG.281.10
Xander Schauffele%68.06
AVG.302.30

158 rows × 1 columns

df_tst.loc['Bryson DeChambeau', :] ->
    2020
     pga

% 66.67 AVG. 318.10

Relabel Indices

That’s looking better. But really need to do something about the stat type labels. And maybe get rid of the all caps for the index label.

In [8]:
df1.rename(columns={'AVG.': 'drv', '%': 'gir'}, inplace=True)
df1.rename_axis(index={'PLAYER NAME': 'player'}, inplace=True)
print('\ndf1 with columns renamed ->\n')
display(df1.head())
#cols = pd.MultiIndex.from_tuples([("lang",), ("2018","height"), ("2018","weight"), ("2019","height"), ("2019","weight")])
cols1 = pd.MultiIndex.from_tuples([('2020', 'pga')])
s1 = df1.stack()
#print(s1.index)
#print(s1.columns)
df1a = pd.DataFrame(s1, columns=cols1)
print('\ndf1a (multi-indexed) ->\n')
display(df1a)
df2.rename(columns={'AVG.': 'drv', '%': 'gir'}, inplace=True)
df2.rename_axis(index={'PLAYER NAME': 'player'}, inplace=True)
s2 = df2.stack()
df2a = pd.DataFrame(s2, columns=cols1)
df_tst = pd.concat([df1a, df2a])
df_tst = df_tst.sort_index()
#print(f"\ntype(df_tst): {type(df_tst)}\n")
print('\ndf_tst (concat of df1a / df2a, sorted?) ->\n')
display(df_tst)

df1 with columns renamed ->

drv
player
Cameron Champ321.1
Bryson DeChambeau318.1
Rory McIlroy312.5
Sepp Straka305.8
Tommy Fleetwood305.5
df1a (multi-indexed) ->
2020
pga
player
Cameron Champdrv321.1
Bryson DeChambeaudrv318.1
Rory McIlroydrv312.5
Sepp Strakadrv305.8
Tommy Fleetwooddrv305.5
.........
Charl Schwartzeldrv276.0
Chez Reaviedrv274.5
Brendon Todddrv272.3
Patrick Reeddrv271.6
Mark Hubbarddrv268.8

79 rows × 1 columns

df_tst (concat of df1a / df2a, sorted?) ->
2020
pga
player
Abraham Ancerdrv295.60
gir63.89
Adam Hadwindrv282.90
gir54.17
Adam Longdrv285.40
.........
Viktor Hovlandgir65.28
Webb Simpsondrv281.10
gir62.50
Xander Schauffeledrv302.30
gir68.06

158 rows × 1 columns

Add Index Label

We don’t seem to have a label for the stat type column in the multi-index. So let’s fix that.

In [9]:
# Add label for stat column of multi-index
df_tst.rename_axis(['player', 'stat'], inplace=True)
print('\ndf_tst (added index column name) ->\n')
display(df_tst)
print("\ndf_tst.loc['Bryson DeChambeau', :] ->\n")
display(df_tst.loc['Bryson DeChambeau', :])

df_tst (added index column name) ->

2020
pga
playerstat
Abraham Ancerdrv295.60
gir63.89
Adam Hadwindrv282.90
gir54.17
Adam Longdrv285.40
.........
Viktor Hovlandgir65.28
Webb Simpsondrv281.10
gir62.50
Xander Schauffeledrv302.30
gir68.06

158 rows × 1 columns

df_tst.loc['Bryson DeChambeau', :] ->

2020
pga
stat
drv318.10
gir66.67

Add Next Tournament for 2020

Okay, now let’s do that all over for the 2nd tournament of 2020, RBC Heritage. Then combine the two tournaments in one DataFrame and see what we get. Given the change in the order the CSVs will be processed, I also reworked the appropriate list.

In [11]:
# change order of csv files to reflect order I want in dataframe
csvs = ['pga_2020_drv_2.test.csv', 'pga_2020_gir_2.test.csv',
        'rbch_2020_drv_2.test.csv', 'rbch_2020_gir_2.test.csv',
        'pga_2021_drv_2.test.csv', 'pga_2021_gir_2.test.csv',
        'rbch_2021_drv_2.test.csv', 'rbch_2021_gir_2.test.csv'
]
df3 = pd.read_csv(f"{d_dir}{csvs[2]}", index_col=['PLAYER NAME'], usecols=['AVG.', 'PLAYER NAME'])
df4 = pd.read_csv(f"{d_dir}{csvs[3]}", index_col=['PLAYER NAME'], usecols=['%', 'PLAYER NAME'])
cols1 = pd.MultiIndex.from_tuples([('2020', 'rbch')])

df3.rename(columns={'AVG.': 'drv', '%': 'gir'}, inplace=True) df3.rename_axis(index={'PLAYER NAME': 'player'}, inplace=True) s3 = df3.stack() df3a = pd.DataFrame(s3, columns=cols1)

df4.rename(columns={'AVG.': 'drv', '%': 'gir'}, inplace=True) df4.rename_axis(index={'PLAYER NAME': 'player'}, inplace=True) s4 = df4.stack() df4a = pd.DataFrame(s4, columns=cols1)

df_tst2 = pd.concat([df3a, df4a]) df_tst2.rename_axis(['player', 'stat'], inplace=True) ndx_sort2 = sorted(df_tst2.index,key=lambda x: re.split(r'\W+', x[0])[-1]) df_tst2 = df_tst2.reindex(ndx_sort2) display(df_tst2)

2020
rbch
playerstat
Abraham Ancerdrv278.10
gir90.28
Ryan Armourdrv275.00
gir56.94
Daniel Bergerdrv273.60
.........
Matt Wallacegir65.28
Bubba Watsondrv277.50
gir65.28
Gary Woodlanddrv287.00
gir73.61

150 rows × 1 columns

In [12]:
# now lets concatenate our two stats dfs
df_comb = pd.concat([df_tst, df_tst2])
print('\ndf_comb.head(10) ->\n')
display(df_comb.head(10))
print(f"\nlen(df_comb): {len(df_comb)}")
print('\ndf_comb.dropna() ->\n')
print(df_comb.dropna())
df_comb.head(10) ->

2020
pgarbch
playerstat
Byeong Hun Andrv286.60NaN
gir62.50NaN
Abraham Ancerdrv295.60NaN
gir63.89NaN
Daniel Bergerdrv291.90NaN
gir65.28NaN
Patrick Cantlaydrv295.30NaN
gir59.72NaN
Paul Caseydrv295.80NaN
gir76.39NaN
len(df_comb): 308

df_comb.dropna() ->

Empty DataFrame Columns: [(2020, pga), (2020, rbch)] Index: []

Looks like no one playing in the 2020 RBC Heritage also played in the 2020 PGA Champsionships. Guess I didn’t make the best choice of tournaments to use. But on closer examination things are not what they seem!

Not Quite Right

In [13]:
print("\ndf_comb.dropna(subset=[('2020', 'pga')]) ->\n")
display(df_comb.dropna(how='any', subset=[('2020', 'pga')]))

print("\ndf_comb.dropna(subset=[('2020', 'rbch')]) ->\n") cols= [(c0, c1) for (c0, c1) in df_comb.columns if c1 in ['rbch']] display(df_comb.dropna(how='any', subset=cols)) # what's going on print("\ndf_comb.loc['Abraham Ancer'] ->\n") print(df_comb.loc['Abraham Ancer'])

df_comb.dropna(subset=[('2020', 'pga')]) ->

2020
pgarbch
playerstat
Byeong Hun Andrv286.60NaN
gir62.50NaN
Abraham Ancerdrv295.60NaN
gir63.89NaN
Daniel Bergerdrv291.90NaN
............
Matthew Wolffgir77.78NaN
Gary Woodlanddrv293.00NaN
gir65.28NaN
Tiger Woodsdrv304.00NaN
gir62.50NaN

158 rows × 2 columns

df_comb.dropna(subset=[('2020', 'rbch')]) ->

2020
pgarbch
playerstat
Abraham AncerdrvNaN278.10
girNaN90.28
Ryan ArmourdrvNaN275.00
girNaN56.94
Daniel BergerdrvNaN273.60
............
Matt WallacegirNaN65.28
Bubba WatsondrvNaN277.50
girNaN65.28
Gary WoodlanddrvNaN287.00
girNaN73.61

150 rows × 2 columns

df_comb.loc['Abraham Ancer'] ->
    2020        
     pga    rbch

stat
drv 295.60 NaN gir 63.89 NaN drv NaN 278.10 gir NaN 90.28

Definitely not what we want.

concat(), join() or merge()

Guess I should have looked in tutorials on combining datasets before I undertook this experiment. After some more messing around, I managed to get pandas.merge() to produce something like what I wanted. At least I think so.

In [14]:
df_comb = pd.merge(df_tst, df_tst2, how='outer', on=['player', 'stat'])
ndx_sort3 = sorted(df_comb.index, key=lambda x: re.split(r'\W+', x[0])[-1])
df_comb = df_comb.reindex(ndx_sort3)

display(df_comb)

2020
pgarbch
playerstat
Byeong Hun Andrv286.60NaN
gir62.50NaN
Abraham Ancerdrv295.60278.10
gir63.8990.28
Ryan ArmourdrvNaN275.00
............
Matthew Wolffgir77.78NaN
Gary Woodlanddrv293.00287.00
gir65.2873.61
Tiger Woodsdrv304.00NaN
gir62.50NaN

234 rows × 2 columns

In [15]:
display(df_comb.dropna())
2020
pgarbch
playerstat
Abraham Ancerdrv295.60278.10
gir63.8990.28
Daniel Bergerdrv291.90273.60
gir65.2869.44
Joel Dahmendrv289.50270.30
............
Matt Wallacegir65.2865.28
Bubba Watsondrv290.10277.50
gir70.8365.28
Gary Woodlanddrv293.00287.00
gir65.2873.61

74 rows × 2 columns

Timeout Needed

I have spent a number of hours on this today. So, I think I will call it quits on this post.

I will regroup, start another notebook/post to try and finish this with all four tournament stats in one DataFrame. May look at using some functions to reduce the code I am generating, etc.

If you wish to play with the above, feel free to download my notebook covering the contents of this post.

Resources

I had a look at these while troubleshooting my problems/bugs, but not sure how useful they may be for everyone else.