I have decided to take a little side trip. As I mentioned in the previous post getting the golf stats was the hardest part of one of the DataFrame exercises in using hierarchical indices. So, I am going to play with automating that process using whatever tools pandas, NumPy and Python can provide. This is partly due to a tutorial I saw using pandas.read_html(). And, since I was getting the golf stats from a web page displaying an HTML table thought I’d give it a try. Don’t know how well it will work and what issues I will encounter trying to build a suitable multi-indexed dataframe. But, will never know if I don’t give a try. So, here we go.

Some New Packages Required

Pandas provides the read.html() method, but that method depends on some additional packages. Firstly, it expects that we have BeautifulSoup4 installed. Then we need either lxml or html5lib, or both, installed. These may in fact require additional packages, html5lib definitely did. Fortunately, I had no problems installing them using conda. Don’t forget to activate the proper conda environment.

I did read that in some coding/working setups, people had troubles installing lxml. And, there a few considerations when parsing HTML tables/page with read_html(). Have a look at HTML Table Parsing Gotchas. That said,

(base) PS R:\learn\ds_intro> conda activate ds-3.9

(ds-3.9) PS R:\learn\ds_intro> conda install html5lib
Collecting package metadata (current_repodata.json): done
Solving environment: done
  ...
  ...
Downloading and Extracting Packages
html5lib-1.1         | 93 KB     | ############################################################################ | 100%
ca-certificates-2021 | 115 KB    | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

(ds-3.9) PS R:\learn\ds_intro> conda install beautifulsoup4
Collecting package metadata (current_repodata.json): done
Solving environment: done
  ...
  ...
Downloading and Extracting Packages
soupsieve-2.2.1      | 32 KB     | ############################################################################ | 100%
beautifulsoup4-4.9.3 | 86 KB     | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

(ds-3.9) PS R:\learn\ds_intro> conda install lxml
Collecting package metadata (current_repodata.json): done
Solving environment: done
  ...
  ...
Downloading and Extracting Packages
libxml2-2.9.10       | 1.5 MB    | ############################################################################ | 100%
libiconv-1.15        | 626 KB    | ############################################################################ | 100%
lxml-4.6.3           | 978 KB    | ############################################################################ | 100%
libxslt-1.1.34       | 399 KB    | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Test Run

Now, let’s crack open a Jupyter notebook and give this a go. We’ll start with our base imports for NumPy and pandas. Then see if we can get the stats from one of the pages on the PGA Tour site. Be prepared to wait several seconds for the output to display.

In [1]:

import numpy as np
import pandas as pd

In [2]:

# test with known url: 2021 PGA driving distance
tst_url = "https://www.pgatour.com/stats/stat.101.y2021.eon.t033.html"
stat = pd.read_html(tst_url)
print(f"type(stat): {type(stat)}, len(stat): {len(stat)}")

type(stat): <class 'list'>, len(stat): 2

Now let’s have look what we have in that stat list.

In [3]:

print(f"type(stat[0]): {type(stat[0])}, type(stat[1]): {type(stat[1])}\n")
print(stat[0])
print()
stat[1].head()

type(stat[0]): <class 'pandas.core.frame.DataFrame'>, type(stat[1]): <class 'pandas.core.frame.DataFrame'>
0                                                  1   2
0 NaN It appears your browser may be outdated. For t… NaN

Out[3]:

	RANK THIS WEEK	RANK LAST WEEK	PLAYER NAME	ROUNDS	AVG.	TOTAL DISTANCE	TOTAL DRIVES
0	1	2	Bryson DeChambeau	4	327.6	2621	8
1	2	NaN	Dean Burmester	4	324.1	2593	8
2	3	NaN	Rory McIlroy	4	320.8	2566	8
3	4	NaN	Joaquin Niemann	4	320.1	2561	8
4	5	NaN	Garrick Higgo	4	318.6	2549	8

Well we don’t really need all those columns. So, let’s pick a few we might actually be interested in.

In [4]:

# don't care about a number of those columns
d_cols = ['PLAYER NAME', 'AVG.', 'TOTAL DISTANCE', 'TOTAL DRIVES']
t_stats = stat[1][d_cols]
display(t_stats)

	PLAYER NAME	AVG.	TOTAL DISTANCE	TOTAL DRIVES
0	Bryson DeChambeau	327.6	2621	8
1	Dean Burmester	324.1	2593	8
2	Rory McIlroy	320.8	2566	8
3	Joaquin Niemann	320.1	2561	8
4	Garrick Higgo	318.6	2549	8
...	...	...	...	...
76	Joel Dahmen	287.5	2300	8
77	Henrik Stenson	286.4	2291	8
78	Steve Stricker	285.1	2281	8
79	Tom Hoge	283.4	2267	8
80	Russell Henley	282.6	2261	8

81 rows × 4 columns

That seemed to work rather well. Now let’s look at saving the data to a local file. Once I have enough data to play with, I don’t want to be going out to the internet everytime I decide to change something I am doing to build the desired data set. That is something similar to the multi-indexed row and column DataFrame we looked at in the previous post.

I have *.test.* in my .gitignore so that I save stuff like this without Git getting all excited. So, my CSV file name will have that string in it just before the csv extension. We will use pandas.DataFrame.to_csv() to write the data to the local file. And, I am also going to use the index= parameter, set to False, to stop the row index from being written to the CSV file. We won’t really be using that in future. Once it is written I will use pandas.read_csv() to read the file and print out some of the contents to make sure it worked as expected.

In [5]:

# now let's save that data to a local csv file, don't want to keep retrieving it
tst_csv = f'./data/pga_2021_drv.test.csv'
# using the columns= parameter didn't seem to work
t_stats.to_csv(tst_csv, columns=d_cols, index=False)
# let's read it back
ts_back = pd.read_csv(tst_csv)
display(ts_back)

	PLAYER NAME	AVG.	TOTAL DISTANCE	TOTAL DRIVES
0	Bryson DeChambeau	327.6	2621	8
1	Dean Burmester	324.1	2593	8
2	Rory McIlroy	320.8	2566	8
3	Joaquin Niemann	320.1	2561	8
4	Garrick Higgo	318.6	2549	8
...	...	...	...	...
76	Joel Dahmen	287.5	2300	8
77	Henrik Stenson	286.4	2291	8
78	Steve Stricker	285.1	2281	8
79	Tom Hoge	283.4	2267	8
80	Russell Henley	282.6	2261	8

81 rows × 4 columns

Okay, that appeared to work as desired. So, now let’s rework everything to “scrape” and save the stats for a couple or three tournaments, years and/or statistics. I think I will start a new notebook. If you wish to play with the above, feel free to download my notebook covering the above.

Get A Larger Set of Stats

I will start, after the usual imports, by defining some dictionaries. They will cover 2 of the basic values needed to get the correct page on the PGA Tour site. And, a set of columns for each stat type — they differ by stat type.

tournament labels to PGA Tour event ids
stat type labels to PGA Tour stat ids
desired column labels for each stat type

I basically loaded different stats pages on the site and pulled this information from the page URLs in the browser’s address bar.

There will also be lists for the specific tournaments, years and stats I wish to scrape from the site. These lists will likely be shorter than the dictionaries mentioned above. This is afterall just a coding experiment.

In [2]:

events = {'pga': 't033', 'wmpo': 't003', 'tpc': 'to11', 'rbch': 't012', 'masters': 't014'}
stats = {'drv': '101', 'gir': '103', 't2g': '02674'}
st_cols = {'drv': ['PLAYER NAME', 'AVG.', 'TOTAL DISTANCE', 'TOTAL DRIVES'],
           'gir': ['PLAYER NAME', '%', 'GREENS HIT', '# HOLES', 'RELATIVE/PAR'],
           't2g': ['PLAYER NAME', 'AVERAGE', 'SG:OTT', 'SG:APR', 'SG:ARG']
}

In [3]:

# define the tournaments, stats and years to collect
t_ids = ['pga', 'rbch']
t_yrs = ['2020', '2021']
p_sts = ['drv', 'gir']

Okay, now let’s write some loops and get 8 sets of stats. That will require 8 calls to read_html() with the appropriate links.

In [5]:

s_tm = time.time()
for t_id in t_ids:
    e_id = events[t_id]
    for t_yr in t_yrs:
        for p_st in p_sts:
            st_id = stats[p_st]
            t_lnk = f'https://www.pgatour.com/content/pgatour/stats/stat.{st_id}.y{t_yr}.eon.{e_id}.html'
            stat_1 = pd.read_html(t_lnk)
            if len(stat_1) <= 1:
                break
            print(f"\n{t_id}, {t_yr}, {p_st} ->")
            df_stats = stat_1[1][st_cols[p_st]]
            print(df_stats.head())
            f_out = f'./data/{t_id}_{t_yr}_{p_st}_2.test.csv'
            df_stats.to_csv(f_out, index=False)
f_tm = time.time()
t_diff = f_tm - s_tm
print(f"\ntime diff: {f_tm} - {s_tm} = {t_diff:.7f}")

pga, 2020, drv -> PLAYER NAME AVG. TOTAL DISTANCE TOTAL DRIVES 0 Cameron Champ 321.1 2569 8 1 Bryson DeChambeau 318.1 2545 8 2 Rory McIlroy 312.5 2500 8 3 Sepp Straka 305.8 2446 8 4 Tommy Fleetwood 305.5 2444 8

pga, 2020, gir -> PLAYER NAME % GREENS HIT # HOLES RELATIVE/PAR 0 Matthew Wolff 77.78 56 72 -0.25 1 Paul Casey 76.39 55 72 -0.27 2 Jason Day 76.39 55 72 -0.27 3 Louis Oosthuizen 73.61 53 72 -0.26 4 Cameron Champ 73.61 53 72 -0.30

pga, 2021, drv -> PLAYER NAME AVG. TOTAL DISTANCE TOTAL DRIVES 0 Bryson DeChambeau 327.6 2621 8 1 Dean Burmester 324.1 2593 8 2 Rory McIlroy 320.8 2566 8 3 Joaquin Niemann 320.1 2561 8 4 Garrick Higgo 318.6 2549 8

pga, 2021, gir -> PLAYER NAME % GREENS HIT # HOLES RELATIVE/PAR 0 Charley Hoffman 70.83 51 72 -0.27 1 Louis Oosthuizen 69.44 50 72 -0.28 2 Martin Laird 69.44 50 72 -0.22 3 Keegan Bradley 68.06 49 72 -0.20 4 Paul Casey 68.06 49 72 -0.27

rbch, 2020, drv -> PLAYER NAME AVG. TOTAL DISTANCE TOTAL DRIVES 0 Erik van Rooyen 306.1 2449 8 1 Justin Thomas 304.8 2438 8 2 Brooks Koepka 303.4 2427 8 3 Sepp Straka 300.5 2404 8 4 Xander Schauffele 297.8 2382 8

rbch, 2020, gir -> PLAYER NAME % GREENS HIT # HOLES RELATIVE/PAR 0 Abraham Ancer 90.28 65 72 -0.34 1 Brice Garnett 80.56 58 72 -0.34 2 Corey Conners 79.17 57 72 -0.37 3 Joaquin Niemann 79.17 57 72 -0.37 4 Bryson DeChambeau 79.17 57 72 -0.37

rbch, 2021, drv -> PLAYER NAME AVG. TOTAL DISTANCE TOTAL DRIVES 0 Wyndham Clark 312.3 2498 8 1 Stewart Cink 312.0 2496 8 2 Will Zalatoris 312.0 2496 8 3 Luke List 311.9 2495 8 4 Sam Burns 310.9 2487 8

rbch, 2021, gir -> PLAYER NAME % GREENS HIT # HOLES RELATIVE/PAR 0 Stewart Cink 77.78 56 72 -0.39 1 Collin Morikawa 77.78 56 72 -0.29 2 Matthew NeSmith 76.39 55 72 -0.27 3 Corey Conners 76.39 55 72 -0.33 4 Emiliano Grillo 75.00 54 72 -0.30

time diff: 1622147466.4122176 - 1622147435.031193 = 31.3810246

Took a bit of time, but all in all much faster than when I was doing it by hand for the previous post.

Now, the problem is I have no real idea how to build that DataFrame I am after. Am hoping that pandas can somehow help me get there without me having to build some convoluted dictionary or list to load into a DataFrame. Am currently thinking I might be able to build individual series and somehow join them all together. Then perhaps somehow modify the indexes to get what I want.

But, I want to use the CSVs we created so that I don’t have to scrape web pages each time I run my development code. So, a new notebook. If you wish to play with the above, feel free to download my notebook covering the above.

Done M’Thinks

This is plently long enough. I believe it is probably time to start another post. So, that’s what I am going to do. See you then.

Resources

PGA Tour Stats
HTML Table Parsing Gotchas
pandas.read_html
pandas.DataFrame.to_csv
html5lib / html5lib-python
Beautiful Soup
lxml - XML and HTML with Python
pandas.to_csv() using columns parameter

Too Old To Code

Data Science Basics: Getting Golf Stats from Web Site

Some New Packages Required

Test Run

Get A Larger Set of Stats

Done M’Thinks

Resources