I have decided to take a little side trip. As I mentioned in the previous post getting the golf stats was the hardest part of one of the DataFrame exercises in using hierarchical indices. So, I am going to play with automating that process using whatever tools pandas, NumPy and Python can provide. This is partly due to a tutorial I saw using pandas.read_html()
. And, since I was getting the golf stats from a web page displaying an HTML table thought I’d give it a try. Don’t know how well it will work and what issues I will encounter trying to build a suitable multi-indexed dataframe. But, will never know if I don’t give a try. So, here we go.
Some New Packages Required
Pandas provides the read.html()
method, but that method depends on some additional packages. Firstly, it expects that we have BeautifulSoup4 installed. Then we need either lxml or html5lib, or both, installed. These may in fact require additional packages, html5lib
definitely did. Fortunately, I had no problems installing them using conda
. Don’t forget to activate the proper conda environment.
I did read that in some coding/working setups, people had troubles installing lxml
. And, there a few considerations when parsing HTML tables/page with read_html()
. Have a look at HTML Table Parsing Gotchas. That said,
(base) PS R:\learn\ds_intro> conda activate ds-3.9
(ds-3.9) PS R:\learn\ds_intro> conda install html5lib
Collecting package metadata (current_repodata.json): done
Solving environment: done
...
...
Downloading and Extracting Packages
html5lib-1.1 | 93 KB | ############################################################################ | 100%
ca-certificates-2021 | 115 KB | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(ds-3.9) PS R:\learn\ds_intro> conda install beautifulsoup4
Collecting package metadata (current_repodata.json): done
Solving environment: done
...
...
Downloading and Extracting Packages
soupsieve-2.2.1 | 32 KB | ############################################################################ | 100%
beautifulsoup4-4.9.3 | 86 KB | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(ds-3.9) PS R:\learn\ds_intro> conda install lxml
Collecting package metadata (current_repodata.json): done
Solving environment: done
...
...
Downloading and Extracting Packages
libxml2-2.9.10 | 1.5 MB | ############################################################################ | 100%
libiconv-1.15 | 626 KB | ############################################################################ | 100%
lxml-4.6.3 | 978 KB | ############################################################################ | 100%
libxslt-1.1.34 | 399 KB | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Test Run
Now, let’s crack open a Jupyter notebook and give this a go. We’ll start with our base imports for NumPy and pandas. Then see if we can get the stats from one of the pages on the PGA Tour site. Be prepared to wait several seconds for the output to display.
import numpy as np
import pandas as pd
# test with known url: 2021 PGA driving distance
tst_url = "https://www.pgatour.com/stats/stat.101.y2021.eon.t033.html"
stat = pd.read_html(tst_url)
print(f"type(stat): {type(stat)}, len(stat): {len(stat)}")
Now let’s have look what we have in that stat
list.
print(f"type(stat[0]): {type(stat[0])}, type(stat[1]): {type(stat[1])}\n")
print(stat[0])
print()
stat[1].head()
Well we don’t really need all those columns. So, let’s pick a few we might actually be interested in.
# don't care about a number of those columns
d_cols = ['PLAYER NAME', 'AVG.', 'TOTAL DISTANCE', 'TOTAL DRIVES']
t_stats = stat[1][d_cols]
display(t_stats)
That seemed to work rather well. Now let’s look at saving the data to a local file. Once I have enough data to play with, I don’t want to be going out to the internet everytime I decide to change something I am doing to build the desired data set. That is something similar to the multi-indexed row and column DataFrame we looked at in the previous post.
I have *.test.*
in my .gitignore
so that I save stuff like this without Git getting all excited. So, my CSV file name will have that string in it just before the csv
extension. We will use pandas.DataFrame.to_csv() to write the data to the local file. And, I am also going to use the index=
parameter, set to False
, to stop the row index from being written to the CSV file. We won’t really be using that in future. Once it is written I will use pandas.read_csv() to read the file and print out some of the contents to make sure it worked as expected.
# now let's save that data to a local csv file, don't want to keep retrieving it
tst_csv = f'./data/pga_2021_drv.test.csv'
# using the columns= parameter didn't seem to work
t_stats.to_csv(tst_csv, columns=d_cols, index=False)
# let's read it back
ts_back = pd.read_csv(tst_csv)
display(ts_back)
Okay, that appeared to work as desired. So, now let’s rework everything to “scrape” and save the stats for a couple or three tournaments, years and/or statistics. I think I will start a new notebook. If you wish to play with the above, feel free to download my notebook covering the above.
Get A Larger Set of Stats
I will start, after the usual imports, by defining some dictionaries. They will cover 2 of the basic values needed to get the correct page on the PGA Tour site. And, a set of columns for each stat type — they differ by stat type.
- tournament labels to PGA Tour event ids
- stat type labels to PGA Tour stat ids
- desired column labels for each stat type
I basically loaded different stats pages on the site and pulled this information from the page URLs in the browser’s address bar.
There will also be lists for the specific tournaments, years and stats I wish to scrape from the site. These lists will likely be shorter than the dictionaries mentioned above. This is afterall just a coding experiment.
events = {'pga': 't033', 'wmpo': 't003', 'tpc': 'to11', 'rbch': 't012', 'masters': 't014'}
stats = {'drv': '101', 'gir': '103', 't2g': '02674'}
st_cols = {'drv': ['PLAYER NAME', 'AVG.', 'TOTAL DISTANCE', 'TOTAL DRIVES'],
'gir': ['PLAYER NAME', '%', 'GREENS HIT', '# HOLES', 'RELATIVE/PAR'],
't2g': ['PLAYER NAME', 'AVERAGE', 'SG:OTT', 'SG:APR', 'SG:ARG']
}
# define the tournaments, stats and years to collect
t_ids = ['pga', 'rbch']
t_yrs = ['2020', '2021']
p_sts = ['drv', 'gir']
Okay, now let’s write some loops and get 8 sets of stats. That will require 8 calls to read_html()
with the appropriate links.
s_tm = time.time()
for t_id in t_ids:
e_id = events[t_id]
for t_yr in t_yrs:
for p_st in p_sts:
st_id = stats[p_st]
t_lnk = f'https://www.pgatour.com/content/pgatour/stats/stat.{st_id}.y{t_yr}.eon.{e_id}.html'
stat_1 = pd.read_html(t_lnk)
if len(stat_1) <= 1:
break
print(f"\n{t_id}, {t_yr}, {p_st} ->")
df_stats = stat_1[1][st_cols[p_st]]
print(df_stats.head())
f_out = f'./data/{t_id}_{t_yr}_{p_st}_2.test.csv'
df_stats.to_csv(f_out, index=False)
f_tm = time.time()
t_diff = f_tm - s_tm
print(f"\ntime diff: {f_tm} - {s_tm} = {t_diff:.7f}")
Took a bit of time, but all in all much faster than when I was doing it by hand for the previous post.
Now, the problem is I have no real idea how to build that DataFrame I am after. Am hoping that pandas can somehow help me get there without me having to build some convoluted dictionary or list to load into a DataFrame. Am currently thinking I might be able to build individual series and somehow join them all together. Then perhaps somehow modify the indexes to get what I want.
But, I want to use the CSVs we created so that I don’t have to scrape web pages each time I run my development code. So, a new notebook. If you wish to play with the above, feel free to download my notebook covering the above.
Done M’Thinks
This is plently long enough. I believe it is probably time to start another post. So, that’s what I am going to do. See you then.
Resources
- PGA Tour Stats
- HTML Table Parsing Gotchas
- pandas.read_html
- pandas.DataFrame.to_csv
- html5lib / html5lib-python
- Beautiful Soup
- lxml - XML and HTML with Python
- pandas.to_csv() using columns parameter