Okay, as mentioned last post, time to refactor my multiple CSVs to single DataFrame experiment. So, new notebook. Remove the junk test code. Maybe some sensible functions (reduce repetition). Etc.

Approach

At the moment, what I am planning to do is create two DataFrames, one for each year (2020, 2021). Those will contain the player stats for 2 tournaments (PGA, RBC Heritage) and two stats (driving distance, greens in regulation %). Pretty sure I can get that to work, as we did just that for 2020 in the previous post. Then I will do my best to merge those two DataFrames into the single one I really want.

Refactor

As usual I will start with the imports and some potentially useful variable/list declarations.

In [1]:
import numpy as np
import pandas as pd
import re
In [2]:
# define some needed variables, lists, and the like
events = {'wmpo': 't003', 'api': 't009', 'tpc': 'to11', 'rbch': 't012', 'masters': 't014', 'pga': 't033'}
stats = {'drv': '101', 'gir': '103', 't2g': '02674', 'scramble': '130'}
st_cols = {'drv': ['PLAYER NAME', 'AVG.', 'TOTAL DISTANCE', 'TOTAL DRIVES'],
           'gir': ['PLAYER NAME', '%', 'GREENS HIT', '# HOLES', 'RELATIVE/PAR'],
           't2g': ['PLAYER NAME', 'AVERAGE', 'SG:OTT', 'SG:APR', 'SG:ARG'],
           'scramble': ['PLAYER NAME', '%', 'PAR OR BETTER', 'MISSED GIR']
}
# data directory and stats, tournaments and years to process
d_dir = "./data/"
p_sids = ['drv', 'gir']
p_tids = ['pga', 'rbch']
p_yrs = ['2020', '2021']

get_csv_nm() & csv_2_df()

As mentioned, I intend to put a lot of the code into functions and perhaps modify, slightly and where appropriate, the steps taken in the previous post. I am going to start with a couple of fairly straightforward functions. The first takes the year, tournament id/label and stat type id/label and returns the appropriate file name. This will also allow me to change the naming convention in one place should I choose to do so.

The second takes a year, tournament label and stat label, reads the CSV file, creates a suitable multi-indexed DataFrame and returns it. I am also relabelling the rows and columns here rather than in later steps as I did last post. Makes more sense to me to just do it here.

And, of course, a quick test or two to make sure the function works as expected.

In [3]:
# function: take year, tounament id and stat id, return CSV file path (relative) 
def get_csv_nm(t_yr, t_id, p_st):
    global d_dir
    return f"{d_dir}{t_id}_{t_yr}_{p_st}_2.test.csv"

# function: will take a year, tournament and stat, read csv, return suitable DataFrame def csv_2_df(t_yr, t_id, p_st): """ Read appropriate CSV file into DataFrame. Modify Dataframe to employ multi-indices. Return modified DataFrame. Useage: csv_2_df(t_yr, t_id, p_st) where t_yr = tournament year t_id = tournament id (e.g. 'pga') p_st = player stat (e.g. 'drv', 'gir') """ global st_cols col_nms = {'drv': {'AVG.': 'drv'}, 'gir': {'%': 'gir'}, 'scramble': {'%': 'scramble'} } csv_fl = get_csv_nm(t_yr, t_id, p_st) ty_m_idx = pd.MultiIndex.from_tuples([(t_yr, t_id)]) s_col = st_cols[p_st][1] df_stats = pd.read_csv(csv_fl, index_col=['PLAYER NAME'], usecols=[s_col, 'PLAYER NAME']) df_stats.rename(columns=col_nms[p_st], inplace=True) s_tmp = df_stats.stack() ts_df = pd.DataFrame(s_tmp, columns=ty_m_idx) ts_df.rename_axis(['player', 'stat'], inplace=True) return ts_df

In [4]:
# quick test
df1 = csv_2_df('2020', 'pga', 'drv')
display(df1)
df2 = csv_2_df('2020', 'pga', 'gir')
display(df2)
2020
pga
playerstat
Cameron Champdrv321.1
Bryson DeChambeaudrv318.1
Rory McIlroydrv312.5
Sepp Strakadrv305.8
Tommy Fleetwooddrv305.5
.........
Charl Schwartzeldrv276.0
Chez Reaviedrv274.5
Brendon Todddrv272.3
Patrick Reeddrv271.6
Mark Hubbarddrv268.8

79 rows × 1 columns

2020
pga
playerstat
Matthew Wolffgir77.78
Paul Caseygir76.39
Jason Daygir76.39
Louis Oosthuizengir73.61
Cameron Champgir73.61
.........
Denny McCarthygir54.17
Harris Englishgir52.78
Brian Harmangir51.39
Brandt Snedekergir51.39
J.T. Postongir50.00

79 rows × 1 columns

Okay, that seemed to work. Now, the next function.

tourney_2_df()

This function will take a single year, a tournament identifier and a list of player stat identifiers. It will use csv_2_df() to get the DataFrames for each stat. Combine the DataFrames appropriately, sort the combined DataFrame and return the result. This function will be used to get the datasets (as DataFrames) for all the desired tournaments and stats for a given year. These will eventually be combined to give us a dataset for all the desired stats for all the desired tournaments in a single year.

A quick and simple test as well.

In [5]:
# function: will take a year, a tournament and a list of stats, generate a DataFrame for that tournament and year
def tourney_2_df(t_yr, t_id, p_sts):
    """ Combine all requested stats for a given tournament and year into a single DataFrame. 
        Return DataFrame.
        Useage: tourney_2_df(t_yr, t_id, p_sts)
          where t_yr = tournament year
                t_id = tournament id (e.g. 'pga')
                p_sts = list of player stat (e.g. ['drv', 'gir'])
    """
    df1 = csv_2_df(t_yr, t_id, p_sts[0])
    if len(p_sts) > 1:
        df2 = csv_2_df(t_yr, t_id, p_sts[1])
        df_tourney = pd.concat([df1, df2])
    else:
        return df1
    if len(p_sts) > 2:
        passs
<span class="n">ndx_sort2</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df_tourney</span><span class="o">.</span><span class="n">index</span><span class="p">,</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\W+&#39;</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">df_tourney</span> <span class="o">=</span> <span class="n">df_tourney</span><span class="o">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">ndx_sort2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df_tourney</span>

In [6]:
# test
pga_2020 = tourney_2_df('2020', 'pga', ['drv', 'gir'])
display(pga_2020)
2020
pga
playerstat
Byeong Hun Andrv286.60
gir62.50
Abraham Ancerdrv295.60
gir63.89
Daniel Bergerdrv291.90
.........
Matthew Wolffgir77.78
Gary Woodlanddrv293.00
gir65.28
Tiger Woodsdrv304.00
gir62.50

158 rows × 1 columns

That looks to work as desired.

year_2_df()

Now, let’s write that function to put together a dataset for all the desired tournaments and stats for a single year. Once again it will take a year, a list of tournaments and a list of stats. It will return ta multi-indexed DataFrame containing all the selected data for that specific year.

As usual a quick, simple test to check our function’s function.

In [7]:
# function: will take a year, list of tournaments and list of stats and generate a DataFrame for that year
def year_2_df(t_yr, t_ids, p_sts):
    df1 = tourney_2_df(t_yr, t_ids[0], p_sts)
    if len(t_ids) == 1:
        return df1
    df2 = tourney_2_df(t_yr, t_ids[1], p_sts)
    df_comb = pd.merge(df1, df2, how='outer', on=['player', 'stat'])
    if len(t_ids) > 2:
        pass
    ndx_sort2 = sorted(df_comb.index,key=lambda x: re.split(r'\W+', x[0])[-1])
    df_comb = df_comb.reindex(ndx_sort2)
<span class="k">return</span> <span class="n">df_comb</span>

In [8]:
# test year_2_df
df_2020 = year_2_df('2020', ['pga', 'rbch'], ['drv', 'gir'])
display(df_2020)
2020
pgarbch
playerstat
Byeong Hun Andrv286.60NaN
gir62.50NaN
Abraham Ancerdrv295.60278.10
gir63.8990.28
Ryan ArmourdrvNaN275.00
............
Matthew Wolffgir77.78NaN
Gary Woodlanddrv293.00287.00
gir65.2873.61
Tiger Woodsdrv304.00NaN
gir62.50NaN

234 rows × 2 columns

That also seems to work.

I am repeatedly sorting the dataframe in each function before returning it. I should likely write a separate function to take care of that, but for now will just repeat the code.

Test Complete Refactor

Okay, now let’s take the two DataFrames for the two years of data that we have and combine them to give us the final result. Hopefully at least.

In [9]:
# test merging two years
df_2021 = year_2_df('2021', ['pga', 'rbch'], ['drv', 'gir'])
display(df_2021)
df_comb = pd.merge(df_2020, df_2021, how='outer', on=['player', 'stat'])
ndx_sort2 = sorted(df_comb.index,key=lambda x: re.split(r'\W+', x[0])[-1])
df_comb = df_comb.reindex(ndx_sort2)
display(df_comb)
2021
pgarbch
playerstat
Byeong Hun Andrv306.50NaN
gir48.61NaN
Abraham Ancerdrv301.80299.90
gir61.1170.83
Daniel Bergerdrv298.40299.60
............
Aaron Wisegir54.17NaN
Gary Woodlanddrv311.10NaN
gir58.33NaN
Will Zalatorisdrv306.50312.00
gir61.1158.33

226 rows × 2 columns

20202021
pgarbchpgarbch
playerstat
Byeong Hun Andrv286.60NaN306.50NaN
gir62.50NaN48.61NaN
Abraham Ancerdrv295.60278.10301.80299.90
gir63.8990.2861.1170.83
Ryan ArmourdrvNaN275.00NaNNaN
..................
Gary Woodlandgir65.2873.6158.33NaN
Tiger Woodsdrv304.00NaNNaNNaN
gir62.50NaNNaNNaN
Will ZalatorisdrvNaNNaN306.50312.00
girNaNNaN61.1158.33

318 rows × 4 columns

That looks to have worked. Though I must admit I haven’t yet compared the result from the previous post with the one above. Guess I should look at doing so. We’ll see.

Basic Validation

Okay, I modified the last two notebooks to each write a CSV file of the dataframe produced for the year 2020. Then I used VSCode to compare the two files. They are indentical.

Enhance

I have decided to further rework the code/notebook to allow for the processing of at least 3 stats, 3 tournaments and 3 years of the preceding. I will again take this one step at a time.

Get More Data

I started by using a separate Python script to get the extra stats and tournament for the existing two years, 2020 and 2021. But, when I decided to add a third year, I added a function to download the new data from the PGA web site. At first the script just downloaded the data whenever called. But, as I was often restarting the notebook, it was always taking the extra time to download all the data each time. So, I modified the function to only download data not already in the data directory.

The first version looked and functioned as follows.

In [18]:
# decided to test using an additional year, 2019
# so need to get some more stats into csv files
def stats_2_csv(tyrs, tids, psts):
    global d_dir, events, stats, st_cols
    for tid in tids:
        eid = events[tid]
        for tyr in tyrs:
            for pst in psts:
                stid = stats[pst]
                tlnk = f'https://www.pgatour.com/content/pgatour/stats/stat.{stid}.y{tyr}.eon.{eid}.html'
                tmp_stats = pd.read_html(tlnk)
                if len(tmp_stats) <= 1:
                    break
                print(f"\n{tid}, {tyr}, {pst}")
                df_stats = tmp_stats[1][st_cols[pst]]
            <span class="n">f_out</span> <span class="o">=</span> <span class="n">get_csv_nm</span><span class="p">(</span><span class="n">tyr</span><span class="p">,</span> <span class="n">tid</span><span class="p">,</span> <span class="n">pst</span><span class="p">)</span>
            <span class="n">df_stats</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">f_out</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>

In [19]:
p_sts = ['drv', 'gir', 'scramble']
t_ids = ['api', 'pga', 'rbch']
t_yrs = ['2019']
stats_2_csv(t_yrs, t_ids, p_sts)
api, 2019, drv

api, 2019, gir

api, 2019, scramble

pga, 2019, drv

pga, 2019, gir

pga, 2019, scramble

rbch, 2019, drv

rbch, 2019, gir

rbch, 2019, scramble

The modified version looks like this:

def stats_2_csv(tyrs, tids, psts):
    global d_dir, events, stats, st_cols
    for tid in tids:
        eid = events[tid]
        for tyr in tyrs:
            for pst in psts:
                print(f"\n{tid}, {tyr}, {pst} ->", end='')
                stid = stats[pst]
                tlnk = f'https://www.pgatour.com/content/pgatour/stats/stat.{stid}.y{tyr}.eon.{eid}.html'
                f_out = get_csv_nm(tyr, tid, pst)
                if Path(f_out).is_file():
                    print(" already exists, not downloaded again")
                else:
                    tmp_stats = pd.read_html(tlnk)
                    if len(tmp_stats) <= 1:
                        print(" not found on site")
                        break
                    df_stats = tmp_stats[1][st_cols[pst]]
                    df_stats.to_csv(f_out, index=False)
                    print(" downloaded and saved to CSV")

tourney_2_df()

Now let’s modify the tourney to DataFrame function to allow for more than 2 stats. And give it a test or two.

In [11]:
# let's try that with 3 different stats
# will need to redefine tourney_2_df
# you may have noticed the extra condition with only a 'pass'
def tourney_2_df(t_yr, t_id, p_sts):
    """ Combine all requested stats for a given tournament and year into a single DataFrame. 
        Return DataFrame.
        Useage: tourney_2_df(t_yr, t_id, p_sts)
          where t_yr = tournament year
                t_id = tournament id (e.g. 'pga')
                p_sts = list of player stat (e.g. ['drv', 'gir'])
    """
    df1 = csv_2_df(t_yr, t_id, p_sts[0])
    if len(p_sts) > 1:
        df2 = csv_2_df(t_yr, t_id, p_sts[1])
        df_tourney = pd.concat([df1, df2])
    else:
        return df1
    if len(p_sts) > 2:
        all_dfs = [df_tourney]
        for p_st in p_sts[2:]:
            df_tmp = csv_2_df(t_yr, t_id, p_st)
            all_dfs.append(df_tmp)
        df_tourney = pd.concat(all_dfs)
<span class="n">ndx_sort2</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df_tourney</span><span class="o">.</span><span class="n">index</span><span class="p">,</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\W+&#39;</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">df_tourney</span> <span class="o">=</span> <span class="n">df_tourney</span><span class="o">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">ndx_sort2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df_tourney</span>

In [12]:
# test time
p_sts = ['drv', 'gir', 'scramble']
pga_2020 = tourney_2_df('2020', 'pga', p_sts)
display(pga_2020)
# want to see what the csv looks like
golf_csv = f'{d_dir}golf_play_8.test.csv'
pga_2020.to_csv(golf_csv)
2020
pga
playerstat
Byeong Hun Andrv286.60
gir62.50
scramble62.96
Abraham Ancerdrv295.60
gir63.89
.........
Gary Woodlandgir65.28
scramble64.00
Tiger Woodsdrv304.00
gir62.50
scramble59.26

237 rows × 1 columns

In [13]:
# that seemed to work, so let's do a second tourney and merge
df_2020 = year_2_df('2020', ['pga', 'rbch'], ['drv', 'gir', 'scramble'])
display(df_2020)
2020
pgarbch
playerstat
Byeong Hun Andrv286.60NaN
gir62.50NaN
scramble62.96NaN
Abraham Ancerdrv295.60278.10
gir63.8990.28
............
Gary Woodlandgir65.2873.61
scramble64.0042.11
Tiger Woodsdrv304.00NaN
gir62.50NaN
scramble59.26NaN

351 rows × 2 columns

In [14]:
# okay, and now a 2nd year
df_2021 = year_2_df('2021', ['pga', 'rbch'], ['drv', 'gir', 'scramble'])
display(df_2021)
df_comb = pd.merge(df_2020, df_2021, how='outer', on=['player', 'stat'])
ndx_sort2 = sorted(df_comb.index,key=lambda x: re.split(r'\W+', x[0])[-1])
df_comb = df_comb.reindex(ndx_sort2)
display(df_comb)
2021
pgarbch
playerstat
Byeong Hun Andrv306.50NaN
gir48.61NaN
scramble67.57NaN
Abraham Ancerdrv301.80299.90
gir61.1170.83
............
Gary Woodlandgir58.33NaN
scramble60.00NaN
Will Zalatorisdrv306.50312.00
gir61.1158.33
scramble64.2956.67

339 rows × 2 columns

20202021
pgarbchpgarbch
playerstat
Byeong Hun Andrv286.60NaN306.50NaN
gir62.50NaN48.61NaN
scramble62.96NaN67.57NaN
Abraham Ancerdrv295.60278.10301.80299.90
gir63.8990.2861.1170.83
..................
Tiger Woodsgir62.50NaNNaNNaN
scramble59.26NaNNaNNaN
Will ZalatorisdrvNaNNaN306.50312.00
girNaNNaN61.1158.33
scrambleNaNNaN64.2956.67

477 rows × 4 columns

year_2_df()

That looked to work as desired. Now, I am going to allow for a 3rd tournament. So, let’s refactor year_2_df() and test.

In [15]:
# now let's add a 3rd tournament
# and of course we need to redefine year_2_df
def year_2_df(t_yr, t_ids, p_sts):
    df1 = tourney_2_df(t_yr, t_ids[0], p_sts)
    if len(t_ids) == 1:
        return df1
    df2 = tourney_2_df(t_yr, t_ids[1], p_sts)
    df_comb = pd.merge(df1, df2, how='outer', on=['player', 'stat'])
    if len(t_ids) > 2:
        for t_id in t_ids[2:]:
            df_tmp = tourney_2_df(t_yr, t_id, p_sts)
            df_comb = pd.merge(df_comb, df_tmp, how='outer', on=['player', 'stat'])
<span class="n">ndx_sort2</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df_comb</span><span class="o">.</span><span class="n">index</span><span class="p">,</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\W+&#39;</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">df_comb</span> <span class="o">=</span> <span class="n">df_comb</span><span class="o">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">ndx_sort2</span><span class="p">)</span>

<span class="k">return</span> <span class="n">df_comb</span>

</div>
In [16]:
# testing 1 2 3
df_2020 = year_2_df('2020', ['api', 'pga', 'rbch'], ['drv', 'gir', 'scramble'])
display(df_2020)
2020
apipgarbch
playerstat
Byeong Hun Andrv291.80286.60NaN
gir66.6762.50NaN
scramble54.1762.96NaN
Abraham Ancerdrv281.90295.60278.10
gir48.6163.8990.28
...............
Tiger WoodsgirNaN62.50NaN
scrambleNaN59.26NaN
Xinjun Zhangdrv288.80NaNNaN
gir50.00NaNNaN
scramble52.78NaNNaN

441 rows × 3 columns

Combine 2 Years

Finally, we’ll generate the DataFrame for 2021 and combine the two.

In [17]:
# add the second year
df_2021 = year_2_df('2021', ['api', 'pga', 'rbch'], ['drv', 'gir', 'scramble'])
df_comb = pd.merge(df_2020, df_2021, how='outer', on=['player', 'stat'])
ndx_sort2 = sorted(df_comb.index,key=lambda x: re.split(r'\W+', x[0])[-1])
df_comb = df_comb.reindex(ndx_sort2)
display(df_comb)
20202021
apipgarbchapipgarbch
playerstat
Byeong Hun Andrv291.80286.60NaN290.60306.50NaN
gir66.6762.50NaN56.9448.61NaN
scramble54.1762.96NaN64.5267.57NaN
Abraham Ancerdrv281.90295.60278.10NaN301.80299.90
gir48.6163.8990.28NaN61.1170.83
........................
Will ZalatorisgirNaNNaNNaN69.4461.1158.33
scrambleNaNNaNNaN59.0964.2956.67
Xinjun Zhangdrv288.80NaNNaNNaNNaNNaN
gir50.00NaNNaNNaNNaNNaN
scramble52.78NaNNaNNaNNaNNaN

558 rows × 6 columns

Enhance Further

Well no sense stopping there. Let’s add another function and generate a dataset covering 3 years (2019-2021), 3 tournaments (Arnold Palmer, PGA and RBC Heritage) and 3 stat types (drive length, greens in regulation and scrambling). The new function, golf_stats_2_df(), will take the usual three lists, use the other functions to get/format the requested data (which is presumed to exist in appropriately named CSV files) and return the final DataFrame.

And of course a quick test. Bit of a sorting klitch, but just fixed that manually, didn’t even try to rework the sorting regex. I am also going to save the final DataFrame to a CSV. Might try to use it in future.

In [20]:
# now a new function to build the final df by combining the ones for each year
def golf_stats_2_df(tyrs, tids, psts):
    if len(tyrs) < 1:
        return None
    else:    
        df1 = year_2_df(tyrs[0], tids, psts)
        if len(tyrs) == 1:
            return df1
        else:
            df2 = year_2_df(tyrs[1], tids, psts)
            df_comb = pd.merge(df1, df2, how='outer', on=['player', 'stat'])
            if len(tyrs) > 2:
                for tyr in tyrs[2:]:
                    df_tmp = year_2_df(tyr, tids, psts)
                    df_comb = pd.merge(df_comb, df_tmp, how='outer', on=['player', 'stat'])
    <span class="n">ndx_sort2</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df_comb</span><span class="o">.</span><span class="n">index</span><span class="p">,</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\W+&#39;</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
    <span class="n">df_comb</span> <span class="o">=</span> <span class="n">df_comb</span><span class="o">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">ndx_sort2</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">df_comb</span>

In [21]:
# test time
all_yrs = ['2019', '2020', '2021']
golf_stats = golf_stats_2_df(all_yrs, t_ids, p_sts)
display(golf_stats)
# want to see what the csv looks like
golf_csv = f'{d_dir}golf_stats.test.csv'
golf_stats.to_csv(golf_csv)
201920202021
apipgarbchapipgarbchapipgarbch
playerstat
Ted Potter, Jr.drvNaNNaN274.50NaNNaNNaNNaNNaNNaN
girNaNNaN51.39NaNNaNNaNNaNNaNNaN
scrambleNaNNaN51.43NaNNaNNaNNaNNaNNaN
Byeong Hun Andrv312.90NaNNaN291.80286.6NaN290.60306.50NaN
gir63.89NaNNaN66.6762.5NaN56.9448.61NaN
.................................
Will ZalatorisgirNaNNaNNaNNaNNaNNaN69.4461.1158.33
scrambleNaNNaNNaNNaNNaNNaN59.0964.2956.67
Xinjun ZhangdrvNaNNaNNaN288.80NaNNaNNaNNaNNaN
girNaNNaNNaN50.00NaNNaNNaNNaNNaN
scrambleNaNNaNNaN52.78NaNNaNNaNNaNNaN

678 rows × 9 columns

In [29]:
# that appears to have worked, except for sorting the names
# manually edidted the appropriate csv files and removed the ', Jr.' for the Ted Potter row
# one more time
all_yrs = ['2019', '2020', '2021']
golf_stats = golf_stats_2_df(all_yrs, t_ids, p_sts)
display(golf_stats)
# want to see what the csv looks like
golf_csv = f'{d_dir}golf_stats.test.csv'
golf_stats.to_csv(golf_csv)
201920202021
apipgarbchapipgarbchapipgarbch
playerstat
Byeong Hun Andrv312.90NaNNaN291.80286.60NaN290.60306.50NaN
gir63.89NaNNaN66.6762.50NaN56.9448.61NaN
scramble69.23NaNNaN54.1762.96NaN64.5267.57NaN
Abraham AncerdrvNaN285.60NaN281.90295.60278.10NaN301.80299.90
girNaN56.94NaN48.6163.8990.28NaN61.1170.83
.................................
Will ZalatorisgirNaNNaNNaNNaNNaNNaN69.4461.1158.33
scrambleNaNNaNNaNNaNNaNNaN59.0964.2956.67
Xinjun ZhangdrvNaNNaNNaN288.80NaNNaNNaNNaNNaN
girNaNNaNNaN50.00NaNNaNNaNNaNNaN
scrambleNaNNaNNaN52.78NaNNaNNaNNaNNaN

678 rows × 9 columns

Done With This One

That all seems to have worked. Not sure what it got me, but I had fun, so all good.

And, sadly the fun has come to an end. If you wish to play with the above, feel free to download my notebook covering the contents of this post.

Until next time…

Resources