As I mentioned in the last post:

Now, the problem is I have no real idea how to build that DataFrame I am after. Am hoping that pandas can somehow help me get there without me having to build some convoluted dictionary or list to load into a DataFrame. Am currently thinking I might be able to build individual series and somehow join them all together. Then perhaps somehow modify the indexes to get what I want.

Data Science Basics: pandas — Getting Golf Stats from Web Site

But, here we go. Starting new notebook. Adding the necessary imports (at least that I know of so far). I will also bring over some of the variables from the previous notebook. Then I am going to load two CSV files into DataFrames or Series and see if I can combine them into one DataFrame somehow.

Basic Notebook Setup

Okay, let’s get the basic setup in place.

In [1]:

import numpy as np
import pandas as pd
import re

In [2]:

events = {'wmpo': 't003', 'tpc': 'to11', 'rbch': 't012', 'masters': 't014', 'pga': 't033'}
stats = {'drv': '101', 'gir': '103', 't2g': '02674'}
st_cols = {'drv': ['PLAYER NAME', 'AVG.', 'TOTAL DISTANCE', 'TOTAL DRIVES'],
           'gir': ['PLAYER NAME', '%', 'GREENS HIT', '# HOLES', 'RELATIVE/PAR'],
           't2g': ['PLAYER NAME', 'AVERAGE', 'SG:OTT', 'SG:APR', 'SG:ARG']
}

Read in Data and Attempt to Convert to Desired Shape

I started by defining a variable for the data directory and a list of the CSV files I was going to process. I then read the first two files (PGA Championships, 2020, drive and gir stats). Printing some info and a few rows of each to see what we have.

In [3]:

d_dir = "./data/"
csvs = ['pga_2020_drv_2.test.csv', 'pga_2020_gir_2.test.csv',
        'pga_2021_drv_2.test.csv', 'pga_2021_gir_2.test.csv',
        'rbch_2020_drv_2.test.csv', 'rbch_2020_gir_2.test.csv',
        'rbch_2021_drv_2.test.csv', 'rbch_2021_gir_2.test.csv'
]
df1 = pd.read_csv(f"{d_dir}{csvs[0]}", index_col=['PLAYER NAME'], usecols=['AVG.', 'PLAYER NAME'])
df2 = pd.read_csv(f"{d_dir}{csvs[1]}", index_col=['PLAYER NAME'], usecols=['%', 'PLAYER NAME'])
print(f"df1: {len(df1)}, df2: {len(df2)}\n")
print(df1.head())
print()
print(df2.head())
print()
print(type(df2))

df1: 79, df2: 79
                AVG.
PLAYER NAME
Cameron Champ 321.1
Bryson DeChambeau 318.1
Rory McIlroy 312.5
Sepp Straka 305.8
Tommy Fleetwood 305.5
                  %
PLAYER NAME
Matthew Wolff 77.78
Paul Casey 76.39
Jason Day 76.39
Louis Oosthuizen 73.61
Cameron Champ 73.61
<class 'pandas.core.frame.DataFrame'>

Next I tried concatenating the two DataFrames. Once again printing stuff out to see what was produced.

In [4]:

combined = pd.concat([df1, df2])
print(combined.head())
print()
print(combined.tail())
print()
print(len(combined))

AVG. % PLAYER NAME Cameron Champ 321.1 NaN Bryson DeChambeau 318.1 NaN Rory McIlroy 312.5 NaN Sepp Straka 305.8 NaN Tommy Fleetwood 305.5 NaN

             AVG.      %

PLAYER NAME
Denny McCarthy NaN 54.17 Harris English NaN 52.78 Brian Harman NaN 51.39 Brandt Snedeker NaN 51.39 J.T. Poston NaN 50.00

158

In [5]:

pd.notna(combined["AVG."])

Out[5]:

PLAYER NAME
Cameron Champ         True
Bryson DeChambeau     True
Rory McIlroy          True
Sepp Straka           True
Tommy Fleetwood       True
                     ...  
Denny McCarthy       False
Harris English       False
Brian Harman         False
Brandt Snedeker      False
J.T. Poston          False
Name: AVG., Length: 158, dtype: bool

Definitely not what was wanted. A little looking about led me to the axis= parameter of the concat() method. So gave that a try.

In [6]:

combined = pd.concat([df1, df2], axis=1)
print(combined.head())
print()
print(combined.tail())

AVG. % PLAYER NAME Cameron Champ 321.1 73.61 Bryson DeChambeau 318.1 66.67 Rory McIlroy 312.5 65.28 Sepp Straka 305.8 58.33 Tommy Fleetwood 305.5 63.89

               AVG.      %

PLAYER NAME
Charl Schwartzel 276.0 59.72 Chez Reavie 274.5 55.56 Brendon Todd 272.3 62.50 Patrick Reed 271.6 68.06 Mark Hubbard 268.8 61.11

Now that looks a bit better. But I would like to use the player name and the stat type as a multi-index. Then use the year and tournament as a multi-index on the columns.

Add Multi-index

This took a bit of research and trial & error. Not to mention that my solution seems pretty convoluted. But it also seems to work. I converted each DataFrame (read from CSV) into a Series. Then back to a DataFrame specifying a column mutli-index. Then I concatenated the new DataFrames. Then a quick index sort to get things looking about right. Also tried getting stats for single player, DeChambeau.

In [7]:

s1 = df1.stack()
cols1 = pd.MultiIndex.from_tuples([('2020', 'pga')])
#print(s1.index)
#print(s1.columns)
df1a = pd.DataFrame(s1, columns=cols1)
print('\ndf1a ->')
display(df1a)
s2 = df2.stack()
df2a = pd.DataFrame(s2, columns=cols1)
df_tst = pd.concat([df1a, df2a])
print('\ndf_tst.head(10) ->')
display(df_tst.head(10))
df_tst = df_tst.sort_index()
#print(f"\ntype(df_tst): {type(df_tst)}\n")
print('\ndf_tst, after sorting index column ->')
display(df_tst)
print("\ndf_tst.loc['Bryson DeChambeau', :] ->")
print(df_tst.loc['Bryson DeChambeau', :])

df1a ->

		2020
		pga
PLAYER NAME
Cameron Champ	AVG.	321.1
Bryson DeChambeau	AVG.	318.1
Rory McIlroy	AVG.	312.5
Sepp Straka	AVG.	305.8
Tommy Fleetwood	AVG.	305.5
...	...	...
Charl Schwartzel	AVG.	276.0
Chez Reavie	AVG.	274.5
Brendon Todd	AVG.	272.3
Patrick Reed	AVG.	271.6
Mark Hubbard	AVG.	268.8

79 rows × 1 columns

df_tst.head(10) ->

		2020
		pga
PLAYER NAME
Cameron Champ	AVG.	321.1
Bryson DeChambeau	AVG.	318.1
Rory McIlroy	AVG.	312.5
Sepp Straka	AVG.	305.8
Tommy Fleetwood	AVG.	305.5
Tony Finau	AVG.	305.3
Dustin Johnson	AVG.	305.0
Victor Perez	AVG.	304.4
Haotong Li	AVG.	304.1
Tiger Woods	AVG.	304.0

df_tst, after sorting index column ->

		2020
		pga
PLAYER NAME
Abraham Ancer	%	63.89
Abraham Ancer	AVG.	295.60
Adam Hadwin	%	54.17
Adam Hadwin	AVG.	282.90
Adam Long	%	59.72
...	...	...
Viktor Hovland	AVG.	281.90
Webb Simpson	%	62.50
Webb Simpson	AVG.	281.10
Xander Schauffele	%	68.06
Xander Schauffele	AVG.	302.30

158 rows × 1 columns

df_tst.loc['Bryson DeChambeau', :] ->
    2020
     pga
% 66.67
AVG. 318.10

Relabel Indices

That’s looking better. But really need to do something about the stat type labels. And maybe get rid of the all caps for the index label.

In [8]:

df1.rename(columns={'AVG.': 'drv', '%': 'gir'}, inplace=True)
df1.rename_axis(index={'PLAYER NAME': 'player'}, inplace=True)
print('\ndf1 with columns renamed ->\n')
display(df1.head())
#cols = pd.MultiIndex.from_tuples([("lang",), ("2018","height"), ("2018","weight"), ("2019","height"), ("2019","weight")])
cols1 = pd.MultiIndex.from_tuples([('2020', 'pga')])
s1 = df1.stack()
#print(s1.index)
#print(s1.columns)
df1a = pd.DataFrame(s1, columns=cols1)
print('\ndf1a (multi-indexed) ->\n')
display(df1a)
df2.rename(columns={'AVG.': 'drv', '%': 'gir'}, inplace=True)
df2.rename_axis(index={'PLAYER NAME': 'player'}, inplace=True)
s2 = df2.stack()
df2a = pd.DataFrame(s2, columns=cols1)
df_tst = pd.concat([df1a, df2a])
df_tst = df_tst.sort_index()
#print(f"\ntype(df_tst): {type(df_tst)}\n")
print('\ndf_tst (concat of df1a / df2a, sorted?) ->\n')
display(df_tst)

df1 with columns renamed ->

	drv
player
Cameron Champ	321.1
Bryson DeChambeau	318.1
Rory McIlroy	312.5
Sepp Straka	305.8
Tommy Fleetwood	305.5

df1a (multi-indexed) ->

		2020
		pga
player
Cameron Champ	drv	321.1
Bryson DeChambeau	drv	318.1
Rory McIlroy	drv	312.5
Sepp Straka	drv	305.8
Tommy Fleetwood	drv	305.5
...	...	...
Charl Schwartzel	drv	276.0
Chez Reavie	drv	274.5
Brendon Todd	drv	272.3
Patrick Reed	drv	271.6
Mark Hubbard	drv	268.8

79 rows × 1 columns

df_tst (concat of df1a / df2a, sorted?) ->

		2020
		pga
player
Abraham Ancer	drv	295.60
Abraham Ancer	gir	63.89
Adam Hadwin	drv	282.90
Adam Hadwin	gir	54.17
Adam Long	drv	285.40
...	...	...
Viktor Hovland	gir	65.28
Webb Simpson	drv	281.10
Webb Simpson	gir	62.50
Xander Schauffele	drv	302.30
Xander Schauffele	gir	68.06

158 rows × 1 columns

Add Index Label

We don’t seem to have a label for the stat type column in the multi-index. So let’s fix that.

In [9]:

# Add label for stat column of multi-index
df_tst.rename_axis(['player', 'stat'], inplace=True)
print('\ndf_tst (added index column name) ->\n')
display(df_tst)
print("\ndf_tst.loc['Bryson DeChambeau', :] ->\n")
display(df_tst.loc['Bryson DeChambeau', :])

df_tst (added index column name) ->

		2020
		pga
player	stat
Abraham Ancer	drv	295.60
Abraham Ancer	gir	63.89
Adam Hadwin	drv	282.90
Adam Hadwin	gir	54.17
Adam Long	drv	285.40
...	...	...
Viktor Hovland	gir	65.28
Webb Simpson	drv	281.10
Webb Simpson	gir	62.50
Xander Schauffele	drv	302.30
Xander Schauffele	gir	68.06

158 rows × 1 columns

df_tst.loc['Bryson DeChambeau', :] ->

	2020
	pga
stat
drv	318.10
gir	66.67

Add Next Tournament for 2020

Okay, now let’s do that all over for the 2nd tournament of 2020, RBC Heritage. Then combine the two tournaments in one DataFrame and see what we get. Given the change in the order the CSVs will be processed, I also reworked the appropriate list.

In [11]:

# change order of csv files to reflect order I want in dataframe
csvs = ['pga_2020_drv_2.test.csv', 'pga_2020_gir_2.test.csv',
        'rbch_2020_drv_2.test.csv', 'rbch_2020_gir_2.test.csv',
        'pga_2021_drv_2.test.csv', 'pga_2021_gir_2.test.csv',
        'rbch_2021_drv_2.test.csv', 'rbch_2021_gir_2.test.csv'
]
df3 = pd.read_csv(f"{d_dir}{csvs[2]}", index_col=['PLAYER NAME'], usecols=['AVG.', 'PLAYER NAME'])
df4 = pd.read_csv(f"{d_dir}{csvs[3]}", index_col=['PLAYER NAME'], usecols=['%', 'PLAYER NAME'])
cols1 = pd.MultiIndex.from_tuples([('2020', 'rbch')])
df3.rename(columns={'AVG.': 'drv', '%': 'gir'}, inplace=True)
df3.rename_axis(index={'PLAYER NAME': 'player'}, inplace=True)
s3 = df3.stack()
df3a = pd.DataFrame(s3, columns=cols1)
df4.rename(columns={'AVG.': 'drv', '%': 'gir'}, inplace=True)
df4.rename_axis(index={'PLAYER NAME': 'player'}, inplace=True)
s4 = df4.stack()
df4a = pd.DataFrame(s4, columns=cols1)
df_tst2 = pd.concat([df3a, df4a])
df_tst2.rename_axis(['player', 'stat'], inplace=True)
ndx_sort2 = sorted(df_tst2.index,key=lambda x: re.split(r'\W+', x[0])[-1])
df_tst2 = df_tst2.reindex(ndx_sort2)
display(df_tst2)

		2020
		rbch
player	stat
Abraham Ancer	drv	278.10
Abraham Ancer	gir	90.28
Ryan Armour	drv	275.00
Ryan Armour	gir	56.94
Daniel Berger	drv	273.60
...	...	...
Matt Wallace	gir	65.28
Bubba Watson	drv	277.50
Bubba Watson	gir	65.28
Gary Woodland	drv	287.00
Gary Woodland	gir	73.61

150 rows × 1 columns

In [12]:

# now lets concatenate our two stats dfs
df_comb = pd.concat([df_tst, df_tst2])
print('\ndf_comb.head(10) ->\n')
display(df_comb.head(10))
print(f"\nlen(df_comb): {len(df_comb)}")
print('\ndf_comb.dropna() ->\n')
print(df_comb.dropna())

df_comb.head(10) ->

		2020
		pga	rbch
player	stat
Byeong Hun An	drv	286.60	NaN
Byeong Hun An	gir	62.50	NaN
Abraham Ancer	drv	295.60	NaN
Abraham Ancer	gir	63.89	NaN
Daniel Berger	drv	291.90	NaN
Daniel Berger	gir	65.28	NaN
Patrick Cantlay	drv	295.30	NaN
Patrick Cantlay	gir	59.72	NaN
Paul Casey	drv	295.80	NaN
Paul Casey	gir	76.39	NaN

len(df_comb): 308
df_comb.dropna() ->
Empty DataFrame
Columns: [(2020, pga), (2020, rbch)]
Index: []

Looks like no one playing in the 2020 RBC Heritage also played in the 2020 PGA Champsionships. Guess I didn’t make the best choice of tournaments to use. But on closer examination things are not what they seem!

Not Quite Right

In [13]:

print("\ndf_comb.dropna(subset=[('2020', 'pga')]) ->\n")
display(df_comb.dropna(how='any', subset=[('2020', 'pga')]))
print("\ndf_comb.dropna(subset=[('2020', 'rbch')]) ->\n")
cols= [(c0, c1) for (c0, c1) in df_comb.columns if c1 in ['rbch']]
display(df_comb.dropna(how='any', subset=cols))
# what's going on
print("\ndf_comb.loc['Abraham Ancer'] ->\n")
print(df_comb.loc['Abraham Ancer'])

df_comb.dropna(subset=[('2020', 'pga')]) ->

		2020
		pga	rbch
player	stat
Byeong Hun An	drv	286.60	NaN
Byeong Hun An	gir	62.50	NaN
Abraham Ancer	drv	295.60	NaN
Abraham Ancer	gir	63.89	NaN
Daniel Berger	drv	291.90	NaN
...	...	...	...
Matthew Wolff	gir	77.78	NaN
Gary Woodland	drv	293.00	NaN
Gary Woodland	gir	65.28	NaN
Tiger Woods	drv	304.00	NaN
Tiger Woods	gir	62.50	NaN

158 rows × 2 columns

df_comb.dropna(subset=[('2020', 'rbch')]) ->

		2020
		pga	rbch
player	stat
Abraham Ancer	drv	NaN	278.10
Abraham Ancer	gir	NaN	90.28
Ryan Armour	drv	NaN	275.00
Ryan Armour	gir	NaN	56.94
Daniel Berger	drv	NaN	273.60
...	...	...	...
Matt Wallace	gir	NaN	65.28
Bubba Watson	drv	NaN	277.50
Bubba Watson	gir	NaN	65.28
Gary Woodland	drv	NaN	287.00
Gary Woodland	gir	NaN	73.61

150 rows × 2 columns

df_comb.loc['Abraham Ancer'] ->

    2020        
     pga    rbch

stat
drv 295.60 NaN gir 63.89 NaN drv NaN 278.10 gir NaN 90.28

Definitely not what we want.

concat(), join() or merge()

Guess I should have looked in tutorials on combining datasets before I undertook this experiment. After some more messing around, I managed to get pandas.merge() to produce something like what I wanted. At least I think so.

In [14]:

df_comb = pd.merge(df_tst, df_tst2, how='outer', on=['player', 'stat'])
ndx_sort3 = sorted(df_comb.index, key=lambda x: re.split(r'\W+', x[0])[-1])
df_comb = df_comb.reindex(ndx_sort3)
display(df_comb)

		2020
		pga	rbch
player	stat
Byeong Hun An	drv	286.60	NaN
Byeong Hun An	gir	62.50	NaN
Abraham Ancer	drv	295.60	278.10
Abraham Ancer	gir	63.89	90.28
Ryan Armour	drv	NaN	275.00
...	...	...	...
Matthew Wolff	gir	77.78	NaN
Gary Woodland	drv	293.00	287.00
Gary Woodland	gir	65.28	73.61
Tiger Woods	drv	304.00	NaN
Tiger Woods	gir	62.50	NaN

234 rows × 2 columns

In [15]:

display(df_comb.dropna())

		2020
		pga	rbch
player	stat
Abraham Ancer	drv	295.60	278.10
Abraham Ancer	gir	63.89	90.28
Daniel Berger	drv	291.90	273.60
Daniel Berger	gir	65.28	69.44
Joel Dahmen	drv	289.50	270.30
...	...	...	...
Matt Wallace	gir	65.28	65.28
Bubba Watson	drv	290.10	277.50
Bubba Watson	gir	70.83	65.28
Gary Woodland	drv	293.00	287.00
Gary Woodland	gir	65.28	73.61

74 rows × 2 columns

Timeout Needed

I have spent a number of hours on this today. So, I think I will call it quits on this post.

I will regroup, start another notebook/post to try and finish this with all four tournament stats in one DataFrame. May look at using some functions to reduce the code I am generating, etc.

If you wish to play with the above, feel free to download my notebook covering the contents of this post.

Resources

pandas: Rename columns / index names (labels) of DataFrame *
Pandas Index Explained
pandas.DataFrame.rename_axis
pandas.DataFrame.reindex
Python Pandas: How to set the name of multiindex? *
pandas sort rows by index using a custom function *
re — Regular expression operations *
Pandas Sort: Your Guide to Sorting Data in Python
Filter out rows with missing data (NaN, None, NaT)
Pandas, how to dropna values using subset with multiindex dataframe?
pandas.DataFrame.dropna
Merge, join, and concatenate *

I had a look at these while troubleshooting my problems/bugs, but not sure how useful they may be for everyone else.

Multi-Index Sorting in Pandas
Sort pandas DataFrame with MultiIndex according to column value
Multi Index Sorting in Pandas
How to import CSV file with multi-level columns (Python Basics)

Too Old To Code

Data Science Basics: Create Golf Stats DataFrame

Basic Notebook Setup

Read in Data and Attempt to Convert to Desired Shape

Add Multi-index

Relabel Indices

Add Index Label

Add Next Tournament for 2020

Not Quite Right

concat(), join() or merge()

Timeout Needed

Resources