I have decided to take a little side trip. As I mentioned in the previous post getting the golf stats was the hardest part of one of the DataFrame exercises in using hierarchical indices. So, I am going to play with automating that process using whatever tools pandas, NumPy and Python can provide. This is partly due to a tutorial I saw using pandas.read_html(). And, since I was getting the golf stats from a web page displaying an HTML table thought I’d give it a try. Don’t know how well it will work and what issues I will encounter trying to build a suitable multi-indexed dataframe. But, will never know if I don’t give a try. So, here we go.

Some New Packages Required

Pandas provides the read.html() method, but that method depends on some additional packages. Firstly, it expects that we have BeautifulSoup4 installed. Then we need either lxml or html5lib, or both, installed. These may in fact require additional packages, html5lib definitely did. Fortunately, I had no problems installing them using conda. Don’t forget to activate the proper conda environment.

I did read that in some coding/working setups, people had troubles installing lxml. And, there a few considerations when parsing HTML tables/page with read_html(). Have a look at HTML Table Parsing Gotchas. That said,

(base) PS R:\learn\ds_intro> conda activate ds-3.9

(ds-3.9) PS R:\learn\ds_intro> conda install html5lib
Collecting package metadata (current_repodata.json): done
Solving environment: done
  ...
  ...
Downloading and Extracting Packages
html5lib-1.1         | 93 KB     | ############################################################################ | 100%
ca-certificates-2021 | 115 KB    | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

(ds-3.9) PS R:\learn\ds_intro> conda install beautifulsoup4
Collecting package metadata (current_repodata.json): done
Solving environment: done
  ...
  ...
Downloading and Extracting Packages
soupsieve-2.2.1      | 32 KB     | ############################################################################ | 100%
beautifulsoup4-4.9.3 | 86 KB     | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

(ds-3.9) PS R:\learn\ds_intro> conda install lxml
Collecting package metadata (current_repodata.json): done
Solving environment: done
  ...
  ...
Downloading and Extracting Packages
libxml2-2.9.10       | 1.5 MB    | ############################################################################ | 100%
libiconv-1.15        | 626 KB    | ############################################################################ | 100%
lxml-4.6.3           | 978 KB    | ############################################################################ | 100%
libxslt-1.1.34       | 399 KB    | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Test Run

Now, let’s crack open a Jupyter notebook and give this a go. We’ll start with our base imports for NumPy and pandas. Then see if we can get the stats from one of the pages on the PGA Tour site. Be prepared to wait several seconds for the output to display.

In [1]:
import numpy as np
import pandas as pd
In [2]:
# test with known url: 2021 PGA driving distance
tst_url = "https://www.pgatour.com/stats/stat.101.y2021.eon.t033.html"
stat = pd.read_html(tst_url)
print(f"type(stat): {type(stat)}, len(stat): {len(stat)}")
type(stat): <class 'list'>, len(stat): 2

Now let’s have look what we have in that stat list.

In [3]:
print(f"type(stat[0]): {type(stat[0])}, type(stat[1]): {type(stat[1])}\n")
print(stat[0])
print()
stat[1].head()
type(stat[0]): <class 'pandas.core.frame.DataFrame'>, type(stat[1]): <class 'pandas.core.frame.DataFrame'>
0                                                  1   2

0 NaN It appears your browser may be outdated. For t… NaN

Out[3]:
RANK THIS WEEKRANK LAST WEEKPLAYER NAMEROUNDSAVG.TOTAL DISTANCETOTAL DRIVES
012Bryson DeChambeau4327.626218
12NaNDean Burmester4324.125938
23NaNRory McIlroy4320.825668
34NaNJoaquin Niemann4320.125618
45NaNGarrick Higgo4318.625498

Well we don’t really need all those columns. So, let’s pick a few we might actually be interested in.

In [4]:
# don't care about a number of those columns
d_cols = ['PLAYER NAME', 'AVG.', 'TOTAL DISTANCE', 'TOTAL DRIVES']
t_stats = stat[1][d_cols]
display(t_stats)
PLAYER NAMEAVG.TOTAL DISTANCETOTAL DRIVES
0Bryson DeChambeau327.626218
1Dean Burmester324.125938
2Rory McIlroy320.825668
3Joaquin Niemann320.125618
4Garrick Higgo318.625498
...............
76Joel Dahmen287.523008
77Henrik Stenson286.422918
78Steve Stricker285.122818
79Tom Hoge283.422678
80Russell Henley282.622618

81 rows × 4 columns

That seemed to work rather well. Now let’s look at saving the data to a local file. Once I have enough data to play with, I don’t want to be going out to the internet everytime I decide to change something I am doing to build the desired data set. That is something similar to the multi-indexed row and column DataFrame we looked at in the previous post.

I have *.test.* in my .gitignore so that I save stuff like this without Git getting all excited. So, my CSV file name will have that string in it just before the csv extension. We will use pandas.DataFrame.to_csv() to write the data to the local file. And, I am also going to use the index= parameter, set to False, to stop the row index from being written to the CSV file. We won’t really be using that in future. Once it is written I will use pandas.read_csv() to read the file and print out some of the contents to make sure it worked as expected.

In [5]:
# now let's save that data to a local csv file, don't want to keep retrieving it
tst_csv = f'./data/pga_2021_drv.test.csv'
# using the columns= parameter didn't seem to work
t_stats.to_csv(tst_csv, columns=d_cols, index=False)
# let's read it back
ts_back = pd.read_csv(tst_csv)
display(ts_back)
PLAYER NAMEAVG.TOTAL DISTANCETOTAL DRIVES
0Bryson DeChambeau327.626218
1Dean Burmester324.125938
2Rory McIlroy320.825668
3Joaquin Niemann320.125618
4Garrick Higgo318.625498
...............
76Joel Dahmen287.523008
77Henrik Stenson286.422918
78Steve Stricker285.122818
79Tom Hoge283.422678
80Russell Henley282.622618

81 rows × 4 columns

Okay, that appeared to work as desired. So, now let’s rework everything to “scrape” and save the stats for a couple or three tournaments, years and/or statistics. I think I will start a new notebook. If you wish to play with the above, feel free to download my notebook covering the above.

Get A Larger Set of Stats

I will start, after the usual imports, by defining some dictionaries. They will cover 2 of the basic values needed to get the correct page on the PGA Tour site. And, a set of columns for each stat type — they differ by stat type.

  • tournament labels to PGA Tour event ids
  • stat type labels to PGA Tour stat ids
  • desired column labels for each stat type

I basically loaded different stats pages on the site and pulled this information from the page URLs in the browser’s address bar.

There will also be lists for the specific tournaments, years and stats I wish to scrape from the site. These lists will likely be shorter than the dictionaries mentioned above. This is afterall just a coding experiment.

In [2]:
events = {'pga': 't033', 'wmpo': 't003', 'tpc': 'to11', 'rbch': 't012', 'masters': 't014'}
stats = {'drv': '101', 'gir': '103', 't2g': '02674'}
st_cols = {'drv': ['PLAYER NAME', 'AVG.', 'TOTAL DISTANCE', 'TOTAL DRIVES'],
           'gir': ['PLAYER NAME', '%', 'GREENS HIT', '# HOLES', 'RELATIVE/PAR'],
           't2g': ['PLAYER NAME', 'AVERAGE', 'SG:OTT', 'SG:APR', 'SG:ARG']
}
In [3]:
# define the tournaments, stats and years to collect
t_ids = ['pga', 'rbch']
t_yrs = ['2020', '2021']
p_sts = ['drv', 'gir']

Okay, now let’s write some loops and get 8 sets of stats. That will require 8 calls to read_html() with the appropriate links.

In [5]:
s_tm = time.time()
for t_id in t_ids:
    e_id = events[t_id]
    for t_yr in t_yrs:
        for p_st in p_sts:
            st_id = stats[p_st]
            t_lnk = f'https://www.pgatour.com/content/pgatour/stats/stat.{st_id}.y{t_yr}.eon.{e_id}.html'
            stat_1 = pd.read_html(t_lnk)
            if len(stat_1) <= 1:
                break
            print(f"\n{t_id}, {t_yr}, {p_st} ->")
            df_stats = stat_1[1][st_cols[p_st]]
            print(df_stats.head())
            f_out = f'./data/{t_id}_{t_yr}_{p_st}_2.test.csv'
            df_stats.to_csv(f_out, index=False)
f_tm = time.time()
t_diff = f_tm - s_tm
print(f"\ntime diff: {f_tm} - {s_tm} = {t_diff:.7f}")
pga, 2020, drv ->
         PLAYER NAME   AVG.  TOTAL DISTANCE  TOTAL DRIVES
0      Cameron Champ  321.1            2569             8
1  Bryson DeChambeau  318.1            2545             8
2       Rory McIlroy  312.5            2500             8
3        Sepp Straka  305.8            2446             8
4    Tommy Fleetwood  305.5            2444             8

pga, 2020, gir -> PLAYER NAME % GREENS HIT # HOLES RELATIVE/PAR 0 Matthew Wolff 77.78 56 72 -0.25 1 Paul Casey 76.39 55 72 -0.27 2 Jason Day 76.39 55 72 -0.27 3 Louis Oosthuizen 73.61 53 72 -0.26 4 Cameron Champ 73.61 53 72 -0.30

pga, 2021, drv -> PLAYER NAME AVG. TOTAL DISTANCE TOTAL DRIVES 0 Bryson DeChambeau 327.6 2621 8 1 Dean Burmester 324.1 2593 8 2 Rory McIlroy 320.8 2566 8 3 Joaquin Niemann 320.1 2561 8 4 Garrick Higgo 318.6 2549 8

pga, 2021, gir -> PLAYER NAME % GREENS HIT # HOLES RELATIVE/PAR 0 Charley Hoffman 70.83 51 72 -0.27 1 Louis Oosthuizen 69.44 50 72 -0.28 2 Martin Laird 69.44 50 72 -0.22 3 Keegan Bradley 68.06 49 72 -0.20 4 Paul Casey 68.06 49 72 -0.27

rbch, 2020, drv -> PLAYER NAME AVG. TOTAL DISTANCE TOTAL DRIVES 0 Erik van Rooyen 306.1 2449 8 1 Justin Thomas 304.8 2438 8 2 Brooks Koepka 303.4 2427 8 3 Sepp Straka 300.5 2404 8 4 Xander Schauffele 297.8 2382 8

rbch, 2020, gir -> PLAYER NAME % GREENS HIT # HOLES RELATIVE/PAR 0 Abraham Ancer 90.28 65 72 -0.34 1 Brice Garnett 80.56 58 72 -0.34 2 Corey Conners 79.17 57 72 -0.37 3 Joaquin Niemann 79.17 57 72 -0.37 4 Bryson DeChambeau 79.17 57 72 -0.37

rbch, 2021, drv -> PLAYER NAME AVG. TOTAL DISTANCE TOTAL DRIVES 0 Wyndham Clark 312.3 2498 8 1 Stewart Cink 312.0 2496 8 2 Will Zalatoris 312.0 2496 8 3 Luke List 311.9 2495 8 4 Sam Burns 310.9 2487 8

rbch, 2021, gir -> PLAYER NAME % GREENS HIT # HOLES RELATIVE/PAR 0 Stewart Cink 77.78 56 72 -0.39 1 Collin Morikawa 77.78 56 72 -0.29 2 Matthew NeSmith 76.39 55 72 -0.27 3 Corey Conners 76.39 55 72 -0.33 4 Emiliano Grillo 75.00 54 72 -0.30

time diff: 1622147466.4122176 - 1622147435.031193 = 31.3810246

Took a bit of time, but all in all much faster than when I was doing it by hand for the previous post.

Now, the problem is I have no real idea how to build that DataFrame I am after. Am hoping that pandas can somehow help me get there without me having to build some convoluted dictionary or list to load into a DataFrame. Am currently thinking I might be able to build individual series and somehow join them all together. Then perhaps somehow modify the indexes to get what I want.

But, I want to use the CSVs we created so that I don’t have to scrape web pages each time I run my development code. So, a new notebook. If you wish to play with the above, feel free to download my notebook covering the above.

Done M’Thinks

This is plently long enough. I believe it is probably time to start another post. So, that’s what I am going to do. See you then.

Resources