Okay, as mentioned last post, time to refactor my multiple CSVs to single DataFrame experiment. So, new notebook. Remove the junk test code. Maybe some sensible functions (reduce repetition). Etc.

Approach

At the moment, what I am planning to do is create two DataFrames, one for each year (2020, 2021). Those will contain the player stats for 2 tournaments (PGA, RBC Heritage) and two stats (driving distance, greens in regulation %). Pretty sure I can get that to work, as we did just that for 2020 in the previous post. Then I will do my best to merge those two DataFrames into the single one I really want.

Refactor

As usual I will start with the imports and some potentially useful variable/list declarations.

In [1]:

import numpy as np
import pandas as pd
import re

In [2]:

# define some needed variables, lists, and the like
events = {'wmpo': 't003', 'api': 't009', 'tpc': 'to11', 'rbch': 't012', 'masters': 't014', 'pga': 't033'}
stats = {'drv': '101', 'gir': '103', 't2g': '02674', 'scramble': '130'}
st_cols = {'drv': ['PLAYER NAME', 'AVG.', 'TOTAL DISTANCE', 'TOTAL DRIVES'],
           'gir': ['PLAYER NAME', '%', 'GREENS HIT', '# HOLES', 'RELATIVE/PAR'],
           't2g': ['PLAYER NAME', 'AVERAGE', 'SG:OTT', 'SG:APR', 'SG:ARG'],
           'scramble': ['PLAYER NAME', '%', 'PAR OR BETTER', 'MISSED GIR']
}
# data directory and stats, tournaments and years to process
d_dir = "./data/"
p_sids = ['drv', 'gir']
p_tids = ['pga', 'rbch']
p_yrs = ['2020', '2021']

get_csv_nm() & csv_2_df()

As mentioned, I intend to put a lot of the code into functions and perhaps modify, slightly and where appropriate, the steps taken in the previous post. I am going to start with a couple of fairly straightforward functions. The first takes the year, tournament id/label and stat type id/label and returns the appropriate file name. This will also allow me to change the naming convention in one place should I choose to do so.

The second takes a year, tournament label and stat label, reads the CSV file, creates a suitable multi-indexed DataFrame and returns it. I am also relabelling the rows and columns here rather than in later steps as I did last post. Makes more sense to me to just do it here.

And, of course, a quick test or two to make sure the function works as expected.

In [3]:

# function: take year, tounament id and stat id, return CSV file path (relative) 
def get_csv_nm(t_yr, t_id, p_st):
    global d_dir
    return f"{d_dir}{t_id}_{t_yr}_{p_st}_2.test.csv"
# function: will take a year, tournament and stat, read csv, return suitable DataFrame
def csv_2_df(t_yr, t_id, p_st):
""" Read appropriate CSV file into DataFrame. Modify Dataframe to employ multi-indices.
        Return modified DataFrame.
        Useage: csv_2_df(t_yr, t_id, p_st)
          where t_yr = tournament year
                t_id = tournament id (e.g. 'pga')
                p_st = player stat (e.g. 'drv', 'gir')
    """
global st_cols
col_nms = {'drv': {'AVG.': 'drv'},
'gir': {'%': 'gir'},
'scramble': {'%': 'scramble'}
}
csv_fl = get_csv_nm(t_yr, t_id, p_st)
ty_m_idx = pd.MultiIndex.from_tuples([(t_yr, t_id)])
s_col = st_cols[p_st][1]
df_stats = pd.read_csv(csv_fl, index_col=['PLAYER NAME'], usecols=[s_col, 'PLAYER NAME'])
df_stats.rename(columns=col_nms[p_st], inplace=True)
s_tmp = df_stats.stack()
ts_df = pd.DataFrame(s_tmp, columns=ty_m_idx)
ts_df.rename_axis(['player', 'stat'], inplace=True)
return ts_df

In [4]:

# quick test
df1 = csv_2_df('2020', 'pga', 'drv')
display(df1)
df2 = csv_2_df('2020', 'pga', 'gir')
display(df2)

		2020
		pga
player	stat
Cameron Champ	drv	321.1
Bryson DeChambeau	drv	318.1
Rory McIlroy	drv	312.5
Sepp Straka	drv	305.8
Tommy Fleetwood	drv	305.5
...	...	...
Charl Schwartzel	drv	276.0
Chez Reavie	drv	274.5
Brendon Todd	drv	272.3
Patrick Reed	drv	271.6
Mark Hubbard	drv	268.8

79 rows × 1 columns

		2020
		pga
player	stat
Matthew Wolff	gir	77.78
Paul Casey	gir	76.39
Jason Day	gir	76.39
Louis Oosthuizen	gir	73.61
Cameron Champ	gir	73.61
...	...	...
Denny McCarthy	gir	54.17
Harris English	gir	52.78
Brian Harman	gir	51.39
Brandt Snedeker	gir	51.39
J.T. Poston	gir	50.00

79 rows × 1 columns

Okay, that seemed to work. Now, the next function.

tourney_2_df()

This function will take a single year, a tournament identifier and a list of player stat identifiers. It will use csv_2_df() to get the DataFrames for each stat. Combine the DataFrames appropriately, sort the combined DataFrame and return the result. This function will be used to get the datasets (as DataFrames) for all the desired tournaments and stats for a given year. These will eventually be combined to give us a dataset for all the desired stats for all the desired tournaments in a single year.

A quick and simple test as well.

In [5]:

# function: will take a year, a tournament and a list of stats, generate a DataFrame for that tournament and year
def tourney_2_df(t_yr, t_id, p_sts):
    """ Combine all requested stats for a given tournament and year into a single DataFrame. 
        Return DataFrame.
        Useage: tourney_2_df(t_yr, t_id, p_sts)
          where t_yr = tournament year
                t_id = tournament id (e.g. 'pga')
                p_sts = list of player stat (e.g. ['drv', 'gir'])
    """
    df1 = csv_2_df(t_yr, t_id, p_sts[0])
    if len(p_sts) > 1:
        df2 = csv_2_df(t_yr, t_id, p_sts[1])
        df_tourney = pd.concat([df1, df2])
    else:
        return df1
    if len(p_sts) > 2:
        passs
<span class="n">ndx_sort2</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df_tourney</span><span class="o">.</span><span class="n">index</span><span class="p">,</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\W+&#39;</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">df_tourney</span> <span class="o">=</span> <span class="n">df_tourney</span><span class="o">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">ndx_sort2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df_tourney</span>

In [6]:

# test
pga_2020 = tourney_2_df('2020', 'pga', ['drv', 'gir'])
display(pga_2020)

		2020
		pga
player	stat
Byeong Hun An	drv	286.60
Byeong Hun An	gir	62.50
Abraham Ancer	drv	295.60
Abraham Ancer	gir	63.89
Daniel Berger	drv	291.90
...	...	...
Matthew Wolff	gir	77.78
Gary Woodland	drv	293.00
Gary Woodland	gir	65.28
Tiger Woods	drv	304.00
Tiger Woods	gir	62.50

158 rows × 1 columns

That looks to work as desired.

year_2_df()

Now, let’s write that function to put together a dataset for all the desired tournaments and stats for a single year. Once again it will take a year, a list of tournaments and a list of stats. It will return ta multi-indexed DataFrame containing all the selected data for that specific year.

As usual a quick, simple test to check our function’s function.

In [7]:

# function: will take a year, list of tournaments and list of stats and generate a DataFrame for that year
def year_2_df(t_yr, t_ids, p_sts):
    df1 = tourney_2_df(t_yr, t_ids[0], p_sts)
    if len(t_ids) == 1:
        return df1
    df2 = tourney_2_df(t_yr, t_ids[1], p_sts)
    df_comb = pd.merge(df1, df2, how='outer', on=['player', 'stat'])
    if len(t_ids) > 2:
        pass
    ndx_sort2 = sorted(df_comb.index,key=lambda x: re.split(r'\W+', x[0])[-1])
    df_comb = df_comb.reindex(ndx_sort2)
<span class="k">return</span> <span class="n">df_comb</span>

In [8]:

# test year_2_df
df_2020 = year_2_df('2020', ['pga', 'rbch'], ['drv', 'gir'])
display(df_2020)

		2020
		pga	rbch
player	stat
Byeong Hun An	drv	286.60	NaN
Byeong Hun An	gir	62.50	NaN
Abraham Ancer	drv	295.60	278.10
Abraham Ancer	gir	63.89	90.28
Ryan Armour	drv	NaN	275.00
...	...	...	...
Matthew Wolff	gir	77.78	NaN
Gary Woodland	drv	293.00	287.00
Gary Woodland	gir	65.28	73.61
Tiger Woods	drv	304.00	NaN
Tiger Woods	gir	62.50	NaN

234 rows × 2 columns

That also seems to work.

I am repeatedly sorting the dataframe in each function before returning it. I should likely write a separate function to take care of that, but for now will just repeat the code.

Test Complete Refactor

Okay, now let’s take the two DataFrames for the two years of data that we have and combine them to give us the final result. Hopefully at least.

In [9]:

# test merging two years
df_2021 = year_2_df('2021', ['pga', 'rbch'], ['drv', 'gir'])
display(df_2021)
df_comb = pd.merge(df_2020, df_2021, how='outer', on=['player', 'stat'])
ndx_sort2 = sorted(df_comb.index,key=lambda x: re.split(r'\W+', x[0])[-1])
df_comb = df_comb.reindex(ndx_sort2)
display(df_comb)

		2021
		pga	rbch
player	stat
Byeong Hun An	drv	306.50	NaN
Byeong Hun An	gir	48.61	NaN
Abraham Ancer	drv	301.80	299.90
Abraham Ancer	gir	61.11	70.83
Daniel Berger	drv	298.40	299.60
...	...	...	...
Aaron Wise	gir	54.17	NaN
Gary Woodland	drv	311.10	NaN
Gary Woodland	gir	58.33	NaN
Will Zalatoris	drv	306.50	312.00
Will Zalatoris	gir	61.11	58.33

226 rows × 2 columns

		2020		2021
		pga	rbch	pga	rbch
player	stat
Byeong Hun An	drv	286.60	NaN	306.50	NaN
Byeong Hun An	gir	62.50	NaN	48.61	NaN
Abraham Ancer	drv	295.60	278.10	301.80	299.90
Abraham Ancer	gir	63.89	90.28	61.11	70.83
Ryan Armour	drv	NaN	275.00	NaN	NaN
...	...	...	...	...	...
Gary Woodland	gir	65.28	73.61	58.33	NaN
Tiger Woods	drv	304.00	NaN	NaN	NaN
Tiger Woods	gir	62.50	NaN	NaN	NaN
Will Zalatoris	drv	NaN	NaN	306.50	312.00
Will Zalatoris	gir	NaN	NaN	61.11	58.33

318 rows × 4 columns

That looks to have worked. Though I must admit I haven’t yet compared the result from the previous post with the one above. Guess I should look at doing so. We’ll see.

Basic Validation

Okay, I modified the last two notebooks to each write a CSV file of the dataframe produced for the year 2020. Then I used VSCode to compare the two files. They are indentical.

Enhance

I have decided to further rework the code/notebook to allow for the processing of at least 3 stats, 3 tournaments and 3 years of the preceding. I will again take this one step at a time.

Get More Data

I started by using a separate Python script to get the extra stats and tournament for the existing two years, 2020 and 2021. But, when I decided to add a third year, I added a function to download the new data from the PGA web site. At first the script just downloaded the data whenever called. But, as I was often restarting the notebook, it was always taking the extra time to download all the data each time. So, I modified the function to only download data not already in the data directory.

The first version looked and functioned as follows.

In [18]:

# decided to test using an additional year, 2019
# so need to get some more stats into csv files
def stats_2_csv(tyrs, tids, psts):
    global d_dir, events, stats, st_cols
    for tid in tids:
        eid = events[tid]
        for tyr in tyrs:
            for pst in psts:
                stid = stats[pst]
                tlnk = f'https://www.pgatour.com/content/pgatour/stats/stat.{stid}.y{tyr}.eon.{eid}.html'
                tmp_stats = pd.read_html(tlnk)
                if len(tmp_stats) <= 1:
                    break
                print(f"\n{tid}, {tyr}, {pst}")
                df_stats = tmp_stats[1][st_cols[pst]]
            <span class="n">f_out</span> <span class="o">=</span> <span class="n">get_csv_nm</span><span class="p">(</span><span class="n">tyr</span><span class="p">,</span> <span class="n">tid</span><span class="p">,</span> <span class="n">pst</span><span class="p">)</span>
            <span class="n">df_stats</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">f_out</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>

In [19]:

p_sts = ['drv', 'gir', 'scramble']
t_ids = ['api', 'pga', 'rbch']
t_yrs = ['2019']
stats_2_csv(t_yrs, t_ids, p_sts)

api, 2019, drv

api, 2019, gir

api, 2019, scramble

pga, 2019, drv

pga, 2019, gir

pga, 2019, scramble

rbch, 2019, drv

rbch, 2019, gir

rbch, 2019, scramble

The modified version looks like this:

def stats_2_csv(tyrs, tids, psts):
    global d_dir, events, stats, st_cols
    for tid in tids:
        eid = events[tid]
        for tyr in tyrs:
            for pst in psts:
                print(f"\n{tid}, {tyr}, {pst} ->", end='')
                stid = stats[pst]
                tlnk = f'https://www.pgatour.com/content/pgatour/stats/stat.{stid}.y{tyr}.eon.{eid}.html'
                f_out = get_csv_nm(tyr, tid, pst)
                if Path(f_out).is_file():
                    print(" already exists, not downloaded again")
                else:
                    tmp_stats = pd.read_html(tlnk)
                    if len(tmp_stats) <= 1:
                        print(" not found on site")
                        break
                    df_stats = tmp_stats[1][st_cols[pst]]
                    df_stats.to_csv(f_out, index=False)
                    print(" downloaded and saved to CSV")

tourney_2_df()

Now let’s modify the tourney to DataFrame function to allow for more than 2 stats. And give it a test or two.

In [11]:

# let's try that with 3 different stats
# will need to redefine tourney_2_df
# you may have noticed the extra condition with only a 'pass'
def tourney_2_df(t_yr, t_id, p_sts):
    """ Combine all requested stats for a given tournament and year into a single DataFrame. 
        Return DataFrame.
        Useage: tourney_2_df(t_yr, t_id, p_sts)
          where t_yr = tournament year
                t_id = tournament id (e.g. 'pga')
                p_sts = list of player stat (e.g. ['drv', 'gir'])
    """
    df1 = csv_2_df(t_yr, t_id, p_sts[0])
    if len(p_sts) > 1:
        df2 = csv_2_df(t_yr, t_id, p_sts[1])
        df_tourney = pd.concat([df1, df2])
    else:
        return df1
    if len(p_sts) > 2:
        all_dfs = [df_tourney]
        for p_st in p_sts[2:]:
            df_tmp = csv_2_df(t_yr, t_id, p_st)
            all_dfs.append(df_tmp)
        df_tourney = pd.concat(all_dfs)
<span class="n">ndx_sort2</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df_tourney</span><span class="o">.</span><span class="n">index</span><span class="p">,</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\W+&#39;</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">df_tourney</span> <span class="o">=</span> <span class="n">df_tourney</span><span class="o">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">ndx_sort2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df_tourney</span>

In [12]:

# test time
p_sts = ['drv', 'gir', 'scramble']
pga_2020 = tourney_2_df('2020', 'pga', p_sts)
display(pga_2020)
# want to see what the csv looks like
golf_csv = f'{d_dir}golf_play_8.test.csv'
pga_2020.to_csv(golf_csv)

		2020
		pga
player	stat
Byeong Hun An	drv	286.60
	gir	62.50
	scramble	62.96
Abraham Ancer	drv	295.60
Abraham Ancer	gir	63.89
...	...	...
Gary Woodland	gir	65.28
Gary Woodland	scramble	64.00
Tiger Woods	drv	304.00
	gir	62.50
	scramble	59.26

237 rows × 1 columns

In [13]:

# that seemed to work, so let's do a second tourney and merge
df_2020 = year_2_df('2020', ['pga', 'rbch'], ['drv', 'gir', 'scramble'])
display(df_2020)

		2020
		pga	rbch
player	stat
Byeong Hun An	drv	286.60	NaN
	gir	62.50	NaN
	scramble	62.96	NaN
Abraham Ancer	drv	295.60	278.10
Abraham Ancer	gir	63.89	90.28
...	...	...	...
Gary Woodland	gir	65.28	73.61
Gary Woodland	scramble	64.00	42.11
Tiger Woods	drv	304.00	NaN
	gir	62.50	NaN
	scramble	59.26	NaN

351 rows × 2 columns

In [14]:

# okay, and now a 2nd year
df_2021 = year_2_df('2021', ['pga', 'rbch'], ['drv', 'gir', 'scramble'])
display(df_2021)
df_comb = pd.merge(df_2020, df_2021, how='outer', on=['player', 'stat'])
ndx_sort2 = sorted(df_comb.index,key=lambda x: re.split(r'\W+', x[0])[-1])
df_comb = df_comb.reindex(ndx_sort2)
display(df_comb)

		2021
		pga	rbch
player	stat
Byeong Hun An	drv	306.50	NaN
	gir	48.61	NaN
	scramble	67.57	NaN
Abraham Ancer	drv	301.80	299.90
Abraham Ancer	gir	61.11	70.83
...	...	...	...
Gary Woodland	gir	58.33	NaN
Gary Woodland	scramble	60.00	NaN
Will Zalatoris	drv	306.50	312.00
	gir	61.11	58.33
	scramble	64.29	56.67

339 rows × 2 columns

		2020		2021
		pga	rbch	pga	rbch
player	stat
Byeong Hun An	drv	286.60	NaN	306.50	NaN
	gir	62.50	NaN	48.61	NaN
	scramble	62.96	NaN	67.57	NaN
Abraham Ancer	drv	295.60	278.10	301.80	299.90
Abraham Ancer	gir	63.89	90.28	61.11	70.83
...	...	...	...	...	...
Tiger Woods	gir	62.50	NaN	NaN	NaN
Tiger Woods	scramble	59.26	NaN	NaN	NaN
Will Zalatoris	drv	NaN	NaN	306.50	312.00
	gir	NaN	NaN	61.11	58.33
	scramble	NaN	NaN	64.29	56.67

477 rows × 4 columns

year_2_df()

That looked to work as desired. Now, I am going to allow for a 3rd tournament. So, let’s refactor year_2_df() and test.

In [15]:

# now let's add a 3rd tournament
# and of course we need to redefine year_2_df
def year_2_df(t_yr, t_ids, p_sts):
    df1 = tourney_2_df(t_yr, t_ids[0], p_sts)
    if len(t_ids) == 1:
        return df1
    df2 = tourney_2_df(t_yr, t_ids[1], p_sts)
    df_comb = pd.merge(df1, df2, how='outer', on=['player', 'stat'])
    if len(t_ids) > 2:
        for t_id in t_ids[2:]:
            df_tmp = tourney_2_df(t_yr, t_id, p_sts)
            df_comb = pd.merge(df_comb, df_tmp, how='outer', on=['player', 'stat'])
<span class="n">ndx_sort2</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df_comb</span><span class="o">.</span><span class="n">index</span><span class="p">,</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\W+&#39;</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">df_comb</span> <span class="o">=</span> <span class="n">df_comb</span><span class="o">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">ndx_sort2</span><span class="p">)</span>

<span class="k">return</span> <span class="n">df_comb</span>

</div>

In [16]:

# testing 1 2 3
df_2020 = year_2_df('2020', ['api', 'pga', 'rbch'], ['drv', 'gir', 'scramble'])
display(df_2020)

		2020
		api	pga	rbch
player	stat
Byeong Hun An	drv	291.80	286.60	NaN
	gir	66.67	62.50	NaN
	scramble	54.17	62.96	NaN
Abraham Ancer	drv	281.90	295.60	278.10
Abraham Ancer	gir	48.61	63.89	90.28
...	...	...	...	...
Tiger Woods	gir	NaN	62.50	NaN
Tiger Woods	scramble	NaN	59.26	NaN
Xinjun Zhang	drv	288.80	NaN	NaN
	gir	50.00	NaN	NaN
	scramble	52.78	NaN	NaN

441 rows × 3 columns

Combine 2 Years

Finally, we’ll generate the DataFrame for 2021 and combine the two.

In [17]:

# add the second year
df_2021 = year_2_df('2021', ['api', 'pga', 'rbch'], ['drv', 'gir', 'scramble'])
df_comb = pd.merge(df_2020, df_2021, how='outer', on=['player', 'stat'])
ndx_sort2 = sorted(df_comb.index,key=lambda x: re.split(r'\W+', x[0])[-1])
df_comb = df_comb.reindex(ndx_sort2)
display(df_comb)

		2020			2021
		api	pga	rbch	api	pga	rbch
player	stat
Byeong Hun An	drv	291.80	286.60	NaN	290.60	306.50	NaN
	gir	66.67	62.50	NaN	56.94	48.61	NaN
	scramble	54.17	62.96	NaN	64.52	67.57	NaN
Abraham Ancer	drv	281.90	295.60	278.10	NaN	301.80	299.90
Abraham Ancer	gir	48.61	63.89	90.28	NaN	61.11	70.83
...	...	...	...	...	...	...	...
Will Zalatoris	gir	NaN	NaN	NaN	69.44	61.11	58.33
Will Zalatoris	scramble	NaN	NaN	NaN	59.09	64.29	56.67
Xinjun Zhang	drv	288.80	NaN	NaN	NaN	NaN	NaN
	gir	50.00	NaN	NaN	NaN	NaN	NaN
	scramble	52.78	NaN	NaN	NaN	NaN	NaN

558 rows × 6 columns

Enhance Further

Well no sense stopping there. Let’s add another function and generate a dataset covering 3 years (2019-2021), 3 tournaments (Arnold Palmer, PGA and RBC Heritage) and 3 stat types (drive length, greens in regulation and scrambling). The new function, golf_stats_2_df(), will take the usual three lists, use the other functions to get/format the requested data (which is presumed to exist in appropriately named CSV files) and return the final DataFrame.

And of course a quick test. Bit of a sorting klitch, but just fixed that manually, didn’t even try to rework the sorting regex. I am also going to save the final DataFrame to a CSV. Might try to use it in future.

In [20]:

# now a new function to build the final df by combining the ones for each year
def golf_stats_2_df(tyrs, tids, psts):
    if len(tyrs) < 1:
        return None
    else:    
        df1 = year_2_df(tyrs[0], tids, psts)
        if len(tyrs) == 1:
            return df1
        else:
            df2 = year_2_df(tyrs[1], tids, psts)
            df_comb = pd.merge(df1, df2, how='outer', on=['player', 'stat'])
            if len(tyrs) > 2:
                for tyr in tyrs[2:]:
                    df_tmp = year_2_df(tyr, tids, psts)
                    df_comb = pd.merge(df_comb, df_tmp, how='outer', on=['player', 'stat'])
    <span class="n">ndx_sort2</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df_comb</span><span class="o">.</span><span class="n">index</span><span class="p">,</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\W+&#39;</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
    <span class="n">df_comb</span> <span class="o">=</span> <span class="n">df_comb</span><span class="o">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">ndx_sort2</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">df_comb</span>

In [21]:

# test time
all_yrs = ['2019', '2020', '2021']
golf_stats = golf_stats_2_df(all_yrs, t_ids, p_sts)
display(golf_stats)
# want to see what the csv looks like
golf_csv = f'{d_dir}golf_stats.test.csv'
golf_stats.to_csv(golf_csv)

		2019			2020			2021
		api	pga	rbch	api	pga	rbch	api	pga	rbch
player	stat
Ted Potter, Jr.	drv	NaN	NaN	274.50	NaN	NaN	NaN	NaN	NaN	NaN
	gir	NaN	NaN	51.39	NaN	NaN	NaN	NaN	NaN	NaN
	scramble	NaN	NaN	51.43	NaN	NaN	NaN	NaN	NaN	NaN
Byeong Hun An	drv	312.90	NaN	NaN	291.80	286.6	NaN	290.60	306.50	NaN
Byeong Hun An	gir	63.89	NaN	NaN	66.67	62.5	NaN	56.94	48.61	NaN
...	...	...	...	...	...	...	...	...	...	...
Will Zalatoris	gir	NaN	NaN	NaN	NaN	NaN	NaN	69.44	61.11	58.33
Will Zalatoris	scramble	NaN	NaN	NaN	NaN	NaN	NaN	59.09	64.29	56.67
Xinjun Zhang	drv	NaN	NaN	NaN	288.80	NaN	NaN	NaN	NaN	NaN
	gir	NaN	NaN	NaN	50.00	NaN	NaN	NaN	NaN	NaN
	scramble	NaN	NaN	NaN	52.78	NaN	NaN	NaN	NaN	NaN

678 rows × 9 columns

In [29]:

# that appears to have worked, except for sorting the names
# manually edidted the appropriate csv files and removed the ', Jr.' for the Ted Potter row
# one more time
all_yrs = ['2019', '2020', '2021']
golf_stats = golf_stats_2_df(all_yrs, t_ids, p_sts)
display(golf_stats)
# want to see what the csv looks like
golf_csv = f'{d_dir}golf_stats.test.csv'
golf_stats.to_csv(golf_csv)

		2019			2020			2021
		api	pga	rbch	api	pga	rbch	api	pga	rbch
player	stat
Byeong Hun An	drv	312.90	NaN	NaN	291.80	286.60	NaN	290.60	306.50	NaN
	gir	63.89	NaN	NaN	66.67	62.50	NaN	56.94	48.61	NaN
	scramble	69.23	NaN	NaN	54.17	62.96	NaN	64.52	67.57	NaN
Abraham Ancer	drv	NaN	285.60	NaN	281.90	295.60	278.10	NaN	301.80	299.90
Abraham Ancer	gir	NaN	56.94	NaN	48.61	63.89	90.28	NaN	61.11	70.83
...	...	...	...	...	...	...	...	...	...	...
Will Zalatoris	gir	NaN	NaN	NaN	NaN	NaN	NaN	69.44	61.11	58.33
Will Zalatoris	scramble	NaN	NaN	NaN	NaN	NaN	NaN	59.09	64.29	56.67
Xinjun Zhang	drv	NaN	NaN	NaN	288.80	NaN	NaN	NaN	NaN	NaN
	gir	NaN	NaN	NaN	50.00	NaN	NaN	NaN	NaN	NaN
	scramble	NaN	NaN	NaN	52.78	NaN	NaN	NaN	NaN	NaN

678 rows × 9 columns

Done With This One

That all seems to have worked. Not sure what it got me, but I had fun, so all good.

And, sadly the fun has come to an end. If you wish to play with the above, feel free to download my notebook covering the contents of this post.

Until next time…

Resources

pandas: Rename columns / index names (labels) of DataFrame *
Pandas Index Explained
pandas.DataFrame.rename_axis
pandas.DataFrame.reindex
pandas sort rows by index using a custom function *
pandas.DataFrame.dropna
Merge, join, and concatenate *
Visual Studio Code - is there a Compare feature like that plugin for Notepad ++?

Too Old To Code

Data Science Basics: Create Golf Stats DataFrame II

Approach

Refactor

get_csv_nm() & csv_2_df()

tourney_2_df()

year_2_df()

Test Complete Refactor

Basic Validation

Enhance

Get More Data

tourney_2_df()

year_2_df()

Combine 2 Years

Enhance Further

Done With This One

Resources