Okay, as mentioned last post, time to refactor my multiple CSVs to single DataFrame experiment. So, new notebook. Remove the junk test code. Maybe some sensible functions (reduce repetition). Etc.
Approach
At the moment, what I am planning to do is create two DataFrames, one for each year (2020, 2021). Those will contain the player stats for 2 tournaments (PGA, RBC Heritage) and two stats (driving distance, greens in regulation %). Pretty sure I can get that to work, as we did just that for 2020 in the previous post. Then I will do my best to merge those two DataFrames into the single one I really want.
Refactor
As usual I will start with the imports and some potentially useful variable/list declarations.
import numpy as np
import pandas as pd
import re
# define some needed variables, lists, and the like
events = {'wmpo': 't003', 'api': 't009', 'tpc': 'to11', 'rbch': 't012', 'masters': 't014', 'pga': 't033'}
stats = {'drv': '101', 'gir': '103', 't2g': '02674', 'scramble': '130'}
st_cols = {'drv': ['PLAYER NAME', 'AVG.', 'TOTAL DISTANCE', 'TOTAL DRIVES'],
'gir': ['PLAYER NAME', '%', 'GREENS HIT', '# HOLES', 'RELATIVE/PAR'],
't2g': ['PLAYER NAME', 'AVERAGE', 'SG:OTT', 'SG:APR', 'SG:ARG'],
'scramble': ['PLAYER NAME', '%', 'PAR OR BETTER', 'MISSED GIR']
}
# data directory and stats, tournaments and years to process
d_dir = "./data/"
p_sids = ['drv', 'gir']
p_tids = ['pga', 'rbch']
p_yrs = ['2020', '2021']
get_csv_nm() & csv_2_df()
As mentioned, I intend to put a lot of the code into functions and perhaps modify, slightly and where appropriate, the steps taken in the previous post. I am going to start with a couple of fairly straightforward functions. The first takes the year, tournament id/label and stat type id/label and returns the appropriate file name. This will also allow me to change the naming convention in one place should I choose to do so.
The second takes a year, tournament label and stat label, reads the CSV file, creates a suitable multi-indexed DataFrame and returns it. I am also relabelling the rows and columns here rather than in later steps as I did last post. Makes more sense to me to just do it here.
And, of course, a quick test or two to make sure the function works as expected.
# function: take year, tounament id and stat id, return CSV file path (relative)
def get_csv_nm(t_yr, t_id, p_st):
global d_dir
return f"{d_dir}{t_id}_{t_yr}_{p_st}_2.test.csv"
# function: will take a year, tournament and stat, read csv, return suitable DataFrame
def csv_2_df(t_yr, t_id, p_st):
""" Read appropriate CSV file into DataFrame. Modify Dataframe to employ multi-indices.
Return modified DataFrame.
Useage: csv_2_df(t_yr, t_id, p_st)
where t_yr = tournament year
t_id = tournament id (e.g. 'pga')
p_st = player stat (e.g. 'drv', 'gir')
"""
global st_cols
col_nms = {'drv': {'AVG.': 'drv'},
'gir': {'%': 'gir'},
'scramble': {'%': 'scramble'}
}
csv_fl = get_csv_nm(t_yr, t_id, p_st)
ty_m_idx = pd.MultiIndex.from_tuples([(t_yr, t_id)])
s_col = st_cols[p_st][1]
df_stats = pd.read_csv(csv_fl, index_col=['PLAYER NAME'], usecols=[s_col, 'PLAYER NAME'])
df_stats.rename(columns=col_nms[p_st], inplace=True)
s_tmp = df_stats.stack()
ts_df = pd.DataFrame(s_tmp, columns=ty_m_idx)
ts_df.rename_axis(['player', 'stat'], inplace=True)
return ts_df
# quick test
df1 = csv_2_df('2020', 'pga', 'drv')
display(df1)
df2 = csv_2_df('2020', 'pga', 'gir')
display(df2)
Okay, that seemed to work. Now, the next function.
tourney_2_df()
This function will take a single year, a tournament identifier and a list of player stat identifiers. It will use csv_2_df()
to get the DataFrames for each stat. Combine the DataFrames appropriately, sort the combined DataFrame and return the result. This function will be used to get the datasets (as DataFrames) for all the desired tournaments and stats for a given year. These will eventually be combined to give us a dataset for all the desired stats for all the desired tournaments in a single year.
A quick and simple test as well.
# function: will take a year, a tournament and a list of stats, generate a DataFrame for that tournament and year
def tourney_2_df(t_yr, t_id, p_sts):
""" Combine all requested stats for a given tournament and year into a single DataFrame.
Return DataFrame.
Useage: tourney_2_df(t_yr, t_id, p_sts)
where t_yr = tournament year
t_id = tournament id (e.g. 'pga')
p_sts = list of player stat (e.g. ['drv', 'gir'])
"""
df1 = csv_2_df(t_yr, t_id, p_sts[0])
if len(p_sts) > 1:
df2 = csv_2_df(t_yr, t_id, p_sts[1])
df_tourney = pd.concat([df1, df2])
else:
return df1
if len(p_sts) > 2:
passs
<span class="n">ndx_sort2</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df_tourney</span><span class="o">.</span><span class="n">index</span><span class="p">,</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">r</span><span class="s1">'\W+'</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">df_tourney</span> <span class="o">=</span> <span class="n">df_tourney</span><span class="o">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">ndx_sort2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df_tourney</span>
# test
pga_2020 = tourney_2_df('2020', 'pga', ['drv', 'gir'])
display(pga_2020)
That looks to work as desired.
year_2_df()
Now, let’s write that function to put together a dataset for all the desired tournaments and stats for a single year. Once again it will take a year, a list of tournaments and a list of stats. It will return ta multi-indexed DataFrame containing all the selected data for that specific year.
As usual a quick, simple test to check our function’s function.
# function: will take a year, list of tournaments and list of stats and generate a DataFrame for that year
def year_2_df(t_yr, t_ids, p_sts):
df1 = tourney_2_df(t_yr, t_ids[0], p_sts)
if len(t_ids) == 1:
return df1
df2 = tourney_2_df(t_yr, t_ids[1], p_sts)
df_comb = pd.merge(df1, df2, how='outer', on=['player', 'stat'])
if len(t_ids) > 2:
pass
ndx_sort2 = sorted(df_comb.index,key=lambda x: re.split(r'\W+', x[0])[-1])
df_comb = df_comb.reindex(ndx_sort2)
<span class="k">return</span> <span class="n">df_comb</span>
# test year_2_df
df_2020 = year_2_df('2020', ['pga', 'rbch'], ['drv', 'gir'])
display(df_2020)
That also seems to work.
I am repeatedly sorting the dataframe in each function before returning it. I should likely write a separate function to take care of that, but for now will just repeat the code.
Test Complete Refactor
Okay, now let’s take the two DataFrames for the two years of data that we have and combine them to give us the final result. Hopefully at least.
# test merging two years
df_2021 = year_2_df('2021', ['pga', 'rbch'], ['drv', 'gir'])
display(df_2021)
df_comb = pd.merge(df_2020, df_2021, how='outer', on=['player', 'stat'])
ndx_sort2 = sorted(df_comb.index,key=lambda x: re.split(r'\W+', x[0])[-1])
df_comb = df_comb.reindex(ndx_sort2)
display(df_comb)
That looks to have worked. Though I must admit I haven’t yet compared the result from the previous post with the one above. Guess I should look at doing so. We’ll see.
Basic Validation
Okay, I modified the last two notebooks to each write a CSV file of the dataframe produced for the year 2020. Then I used VSCode to compare the two files. They are indentical.
Enhance
I have decided to further rework the code/notebook to allow for the processing of at least 3 stats, 3 tournaments and 3 years of the preceding. I will again take this one step at a time.
Get More Data
I started by using a separate Python script to get the extra stats and tournament for the existing two years, 2020 and 2021.
But, when I decided to add a third year, I added a function to download the new data from the PGA web site. At first the script just downloaded the data whenever called. But, as I was often restarting the notebook, it was always taking the extra time to download all the data each time. So, I modified the function to only download data not already in the data
directory.
The first version looked and functioned as follows.
# decided to test using an additional year, 2019
# so need to get some more stats into csv files
def stats_2_csv(tyrs, tids, psts):
global d_dir, events, stats, st_cols
for tid in tids:
eid = events[tid]
for tyr in tyrs:
for pst in psts:
stid = stats[pst]
tlnk = f'https://www.pgatour.com/content/pgatour/stats/stat.{stid}.y{tyr}.eon.{eid}.html'
tmp_stats = pd.read_html(tlnk)
if len(tmp_stats) <= 1:
break
print(f"\n{tid}, {tyr}, {pst}")
df_stats = tmp_stats[1][st_cols[pst]]
<span class="n">f_out</span> <span class="o">=</span> <span class="n">get_csv_nm</span><span class="p">(</span><span class="n">tyr</span><span class="p">,</span> <span class="n">tid</span><span class="p">,</span> <span class="n">pst</span><span class="p">)</span>
<span class="n">df_stats</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">f_out</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
p_sts = ['drv', 'gir', 'scramble']
t_ids = ['api', 'pga', 'rbch']
t_yrs = ['2019']
stats_2_csv(t_yrs, t_ids, p_sts)
The modified version looks like this:
def stats_2_csv(tyrs, tids, psts):
global d_dir, events, stats, st_cols
for tid in tids:
eid = events[tid]
for tyr in tyrs:
for pst in psts:
print(f"\n{tid}, {tyr}, {pst} ->", end='')
stid = stats[pst]
tlnk = f'https://www.pgatour.com/content/pgatour/stats/stat.{stid}.y{tyr}.eon.{eid}.html'
f_out = get_csv_nm(tyr, tid, pst)
if Path(f_out).is_file():
print(" already exists, not downloaded again")
else:
tmp_stats = pd.read_html(tlnk)
if len(tmp_stats) <= 1:
print(" not found on site")
break
df_stats = tmp_stats[1][st_cols[pst]]
df_stats.to_csv(f_out, index=False)
print(" downloaded and saved to CSV")
tourney_2_df()
Now let’s modify the tourney to DataFrame function to allow for more than 2 stats. And give it a test or two.
# let's try that with 3 different stats
# will need to redefine tourney_2_df
# you may have noticed the extra condition with only a 'pass'
def tourney_2_df(t_yr, t_id, p_sts):
""" Combine all requested stats for a given tournament and year into a single DataFrame.
Return DataFrame.
Useage: tourney_2_df(t_yr, t_id, p_sts)
where t_yr = tournament year
t_id = tournament id (e.g. 'pga')
p_sts = list of player stat (e.g. ['drv', 'gir'])
"""
df1 = csv_2_df(t_yr, t_id, p_sts[0])
if len(p_sts) > 1:
df2 = csv_2_df(t_yr, t_id, p_sts[1])
df_tourney = pd.concat([df1, df2])
else:
return df1
if len(p_sts) > 2:
all_dfs = [df_tourney]
for p_st in p_sts[2:]:
df_tmp = csv_2_df(t_yr, t_id, p_st)
all_dfs.append(df_tmp)
df_tourney = pd.concat(all_dfs)
<span class="n">ndx_sort2</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df_tourney</span><span class="o">.</span><span class="n">index</span><span class="p">,</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">r</span><span class="s1">'\W+'</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">df_tourney</span> <span class="o">=</span> <span class="n">df_tourney</span><span class="o">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">ndx_sort2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df_tourney</span>
# test time
p_sts = ['drv', 'gir', 'scramble']
pga_2020 = tourney_2_df('2020', 'pga', p_sts)
display(pga_2020)
# want to see what the csv looks like
golf_csv = f'{d_dir}golf_play_8.test.csv'
pga_2020.to_csv(golf_csv)
# that seemed to work, so let's do a second tourney and merge
df_2020 = year_2_df('2020', ['pga', 'rbch'], ['drv', 'gir', 'scramble'])
display(df_2020)
# okay, and now a 2nd year
df_2021 = year_2_df('2021', ['pga', 'rbch'], ['drv', 'gir', 'scramble'])
display(df_2021)
df_comb = pd.merge(df_2020, df_2021, how='outer', on=['player', 'stat'])
ndx_sort2 = sorted(df_comb.index,key=lambda x: re.split(r'\W+', x[0])[-1])
df_comb = df_comb.reindex(ndx_sort2)
display(df_comb)
year_2_df()
That looked to work as desired. Now, I am going to allow for a 3rd tournament. So, let’s refactor year_2_df()
and test.
# now let's add a 3rd tournament
# and of course we need to redefine year_2_df
def year_2_df(t_yr, t_ids, p_sts):
df1 = tourney_2_df(t_yr, t_ids[0], p_sts)
if len(t_ids) == 1:
return df1
df2 = tourney_2_df(t_yr, t_ids[1], p_sts)
df_comb = pd.merge(df1, df2, how='outer', on=['player', 'stat'])
if len(t_ids) > 2:
for t_id in t_ids[2:]:
df_tmp = tourney_2_df(t_yr, t_id, p_sts)
df_comb = pd.merge(df_comb, df_tmp, how='outer', on=['player', 'stat'])
<span class="n">ndx_sort2</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df_comb</span><span class="o">.</span><span class="n">index</span><span class="p">,</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">r</span><span class="s1">'\W+'</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">df_comb</span> <span class="o">=</span> <span class="n">df_comb</span><span class="o">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">ndx_sort2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df_comb</span>
</div>
# testing 1 2 3
df_2020 = year_2_df('2020', ['api', 'pga', 'rbch'], ['drv', 'gir', 'scramble'])
display(df_2020)
Combine 2 Years
Finally, we’ll generate the DataFrame for 2021 and combine the two.
# add the second year
df_2021 = year_2_df('2021', ['api', 'pga', 'rbch'], ['drv', 'gir', 'scramble'])
df_comb = pd.merge(df_2020, df_2021, how='outer', on=['player', 'stat'])
ndx_sort2 = sorted(df_comb.index,key=lambda x: re.split(r'\W+', x[0])[-1])
df_comb = df_comb.reindex(ndx_sort2)
display(df_comb)
Enhance Further
Well no sense stopping there. Let’s add another function and generate a dataset covering 3 years (2019-2021), 3 tournaments (Arnold Palmer, PGA and RBC Heritage) and 3 stat types (drive length, greens in regulation and scrambling). The new function, golf_stats_2_df()
, will take the usual three lists, use the other functions to get/format the requested data (which is presumed to exist in appropriately named CSV files) and return the final DataFrame.
And of course a quick test. Bit of a sorting klitch, but just fixed that manually, didn’t even try to rework the sorting regex. I am also going to save the final DataFrame to a CSV. Might try to use it in future.
# now a new function to build the final df by combining the ones for each year
def golf_stats_2_df(tyrs, tids, psts):
if len(tyrs) < 1:
return None
else:
df1 = year_2_df(tyrs[0], tids, psts)
if len(tyrs) == 1:
return df1
else:
df2 = year_2_df(tyrs[1], tids, psts)
df_comb = pd.merge(df1, df2, how='outer', on=['player', 'stat'])
if len(tyrs) > 2:
for tyr in tyrs[2:]:
df_tmp = year_2_df(tyr, tids, psts)
df_comb = pd.merge(df_comb, df_tmp, how='outer', on=['player', 'stat'])
<span class="n">ndx_sort2</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df_comb</span><span class="o">.</span><span class="n">index</span><span class="p">,</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">r</span><span class="s1">'\W+'</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">df_comb</span> <span class="o">=</span> <span class="n">df_comb</span><span class="o">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">ndx_sort2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df_comb</span>
# test time
all_yrs = ['2019', '2020', '2021']
golf_stats = golf_stats_2_df(all_yrs, t_ids, p_sts)
display(golf_stats)
# want to see what the csv looks like
golf_csv = f'{d_dir}golf_stats.test.csv'
golf_stats.to_csv(golf_csv)
# that appears to have worked, except for sorting the names
# manually edidted the appropriate csv files and removed the ', Jr.' for the Ted Potter row
# one more time
all_yrs = ['2019', '2020', '2021']
golf_stats = golf_stats_2_df(all_yrs, t_ids, p_sts)
display(golf_stats)
# want to see what the csv looks like
golf_csv = f'{d_dir}golf_stats.test.csv'
golf_stats.to_csv(golf_csv)
Done With This One
That all seems to have worked. Not sure what it got me, but I had fun, so all good.
And, sadly the fun has come to an end. If you wish to play with the above, feel free to download my notebook covering the contents of this post.
Until next time…
Resources
- pandas: Rename columns / index names (labels) of DataFrame *
- Pandas Index Explained
- pandas.DataFrame.rename_axis
- pandas.DataFrame.reindex
- pandas sort rows by index using a custom function *
- pandas.DataFrame.dropna
- Merge, join, and concatenate *
- Visual Studio Code - is there a Compare feature like that plugin for Notepad ++?