The combining of data from different sources is a requirement in many, if not most, data science projects. Pandas provides a couple of ways of combining the data in different objects (i.e. Series and DataFrames). We have seen both of them used in the series of posts on Golf Stats. That is pandas.concat() and pandas.merge().

The former, pd.concat() provides for fairly straightforward concatenation of two different datasets. While the latter, pd.merge() provides for more complicated database-like joins and merges. I expect knowing how to combine different datasets is a key data science skill.

Though I have already used them, I felt a bit of discussion/coverage was in order. If for no other reason than to help me reinforce what little I actually know about them, their strengths and weaknesses.

Python Package

I am going to try to use the golf stats dataframes in this discussion of joining datasets using pandas. So, I decided to put the code from the previous post into a package I could import in the notebook I will be using to generate the code for this post. It took a bit of time and testing, but all in all seems to do the job. I did add a couple extra functions not in the previous post.

One, sort_player_name(df_in), to sort the golf stats on the player’s last name — kept it simple, didn’t involve the player’s first name, for better or worse. Another, csv_2_df_base(t_yr, t_id, p_st), to load the specified CSV file into a DataFrame without any of the mutli-indexing I was doing in csv_2_df(t_yr, t_id, p_st).

There are no tests in the package file. I did that in a notebook, which I have not kept.

The code is provided in a separate post. The post is not listed in the site’s post history.

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
import re
import golf_stats_lib as gs
In [2]:
gs?
Type:        module
String form: <module 'golf_stats_lib' from 'r:\\learn\\ds_intro\\golf_stats_lib.py'>
File:        r:\learn\ds_intro\golf_stats_lib.py
Docstring:  
Module to provide functions related to obtaining statistics from PGA web site and saving to local CSV files.
   As well as functions to retrieve the data in the CSVs and build DataFrames to hold a specified dataset.

File: ./ds_intro/golf_stats_lib.py

  • tied to various Too Old To Code blog posts
  • function(s) to obtain data from PGA web site and save to local CSV files
  • functions(s) to process the CSV files into DataFrames in various ways
  • built on Jypyter notebook pandas_play_8.ipynb

Module level and/or global variables:

These may require updating if you add events or stat types.

events - dictionary of currently available events providing PGA Tour ids stats - dictionary of currently available stat types providing PGA Tour ids st_cols - for each stat type, a list of table columns to save to csv d_dir - directory to which to save stat csv files

Functions:

get_csv_nm(t_yr, t_id, p_st) stats_2_csv(tyrs, tids, psts) csv_2_df_base(t_yr, t_id, p_st) csv_2_df(t_yr, t_id, p_st) tourney_2_df(t_yr, t_id, psts) year_2_df(t_yr, tids, psts) golf_stats_2_df(tyrs, tids, psts) sort_player_name(df_in)

In [3]:
gs.tourney_2_df?
Signature: gs.tourney_2_df(t_yr, t_id, psts)
Docstring:
Combine all requested stats for a given tournament and year into a single DataFrame. 
Return DataFrame.
Useage: tourney_2_df(t_yr, t_id, p_sts)
  where t_yr = tournament year
        t_id = tournament id (e.g. 'pga')
        psts = list of player stats (e.g. ['drv', 'gir'])
File:      r:\learn\ds_intro\golf_stats_lib.py
Type:      function

pd.concat()

Concatenating Series and DataFrames is similar to concatenating NumPy arrays. Something we briefly looked at previously.

Okay, let’s load a couple of CSV files into DataFrames without messing with the indexing. Then concatenate them with pd.concat(). And display the result.

In [4]:
df1 = gs.csv_2_df_base('2019', 'pga', 'drv')
display(df1)
drv
PLAYER NAME
Dustin Johnson335.6
Luke List319.0
Phil Mickelson318.6
Rory McIlroy317.4
Lucas Bjerregaard316.6
......
Marty Jertson281.9
Rich Beem281.9
Henrik Stenson281.0
Brandt Snedeker279.8
Andrew Putnam274.1

82 rows × 1 columns

In [5]:
df2 = gs.csv_2_df_base('2019', 'pga', 'gir')
df3 = pd.concat([df1, df2])
display(df3)
drvgir
PLAYER NAME
Dustin Johnson335.6NaN
Luke List319.0NaN
Phil Mickelson318.6NaN
Rory McIlroy317.4NaN
Lucas Bjerregaard316.6NaN
.........
Rafa Cabrera BelloNaN47.22
Ryan VermeerNaN47.22
Pat PerezNaN47.22
Rich BeemNaN47.22
Marty JertsonNaN41.67

164 rows × 2 columns

Definitely not what I wanted. And the first time I did the above, somewhat unexpected.

concat() Defaults

Couple of things going on.

  1. By default concatenation takes place along the row axis. I.E. the rows of each DataFrame being concatenated will be added to the result. This can be altered with the axis= named parameter.

  2. Pandas preserves indices, as does concat(), even if the result will have duplicate indices. Which the above does (though not really obvious). But look at the row counts in the above DataFrames.

  3. concat() uses a union-like merge by default. This can be altered, somewhat, using the join= named parameter.

There are some options for dealing with item 2. But at the moment that is not really a problem for me. I am also currently fine with the default of item 3. So, I will look at changing the default axis.

axis=1

In [9]:
df3 = pd.concat([df1, df2], axis=1)
df_sort = gs.sort_player_name(df3)
display(df_sort)
drvgir
PLAYER NAME
Abraham Ancer285.656.94
Kiradech Aphibarnrat284.054.17
Rich Beem281.947.22
Rafa Cabrera Bello292.347.22
Daniel Berger301.555.56
.........
Jimmy Walker304.669.44
Matt Wallace300.356.94
Danny Willett307.069.44
Aaron Wise301.059.72
Gary Woodland311.470.83

82 rows × 2 columns

Check the row count. That appears to have worked, more or less as desired. Okay let’s add another stat from a different tournament.

In [10]:
df4 = gs.csv_2_df_base('2019','rbch','gir')
#display(df4)
df5 = pd.concat([df3, df4], axis=1)
#display(df5)
df_sort = gs.sort_player_name(df5)
display(df_sort)
drvgirgir
PLAYER NAME
Abraham Ancer285.656.94NaN
Kiradech Aphibarnrat284.054.17NaN
Ryan ArmourNaNNaN55.56
Rich Beem281.947.22NaN
Rafa Cabrera Bello292.347.2258.33
............
Boo WeekleyNaNNaN59.72
Richy WerenskiNaNNaN47.22
Danny Willett307.069.44NaN
Aaron Wise301.059.72NaN
Gary Woodland311.470.83NaN

124 rows × 3 columns

Well, stats are there but no indication as to which tournament they are from. And, not really something we can ignore or would want to keep track of separately. Afterall if we did what would be the point in using pandas? So let’s add some multi-indexing on the columns of the individual stat DataFrames before we concatenate.

In [11]:
# unfortunately the above does not tell us to which tournament and year the stat belongs
# and likely not something we want to keep track of separately
# So let's try adding that info to each csv dataframe
df6 = gs.csv_2_df_base('2019', 'pga', 'drv')
cols = pd.MultiIndex.from_tuples([('2019', 'pga', 'drv')])
#midx = pd.MultiIndex(levels=[['zero', 'one'], ['x','y']], labels=[[1,1,0,],[1,0,1,]])
df6.columns = cols
df6 = gs.sort_player_name(df6)
display(df6)
2019
pga
drv
PLAYER NAME
Abraham Ancer285.6
Kiradech Aphibarnrat284.0
Rich Beem281.9
Rafa Cabrera Bello292.3
Daniel Berger301.5
......
Jimmy Walker304.6
Matt Wallace300.3
Danny Willett307.0
Aaron Wise301.0
Gary Woodland311.4

82 rows × 1 columns

In [12]:
# now another stat same year and tourney
df7 = gs.csv_2_df_base('2019', 'pga', 'gir')
cols = pd.MultiIndex.from_tuples([('2019', 'pga', 'gir')])
df7.columns = cols
df7 = gs.sort_player_name(df7)
df_comb1 = pd.concat([df6, df7], axis=1)
display(df_comb1.head())
display(df_comb1.loc['Rory McIlroy'])
2019
pga
drvgir
PLAYER NAME
Abraham Ancer285.656.94
Kiradech Aphibarnrat284.054.17
Rich Beem281.947.22
Rafa Cabrera Bello292.347.22
Daniel Berger301.555.56
2019  pga  drv    317.40
           gir     65.28
Name: Rory McIlroy, dtype: float64
In [13]:
# and a different tournament
df8 = gs.csv_2_df_base('2019', 'rbch', 'gir')
cols = pd.MultiIndex.from_tuples([('2019', 'rbch', 'gir')])
df8.columns = cols
df8 = gs.sort_player_name(df8)
df_comb2 = pd.concat([df6, df7, df8], axis=1)
df_comb2 = gs.sort_player_name(df_comb2)
display(df_comb2)
2019
pgarbch
drvgirgir
PLAYER NAME
Abraham Ancer285.656.94NaN
Kiradech Aphibarnrat284.054.17NaN
Ryan ArmourNaNNaN55.56
Rich Beem281.947.22NaN
Rafa Cabrera Bello292.347.2258.33
............
Boo WeekleyNaNNaN59.72
Richy WerenskiNaNNaN47.22
Danny Willett307.069.44NaN
Aaron Wise301.059.72NaN
Gary Woodland311.470.83NaN

124 rows × 3 columns

.columns.names

To make things a touch nicer let’s give our column indices labels. Then get the stats just for gir. A simple example of filtering on multi-indexed columns.

In [14]:
#df_comb2.columns
df_comb2.columns.names = ['year', 'event', 'stat']
filter = df_comb2.columns.get_level_values('stat') == 'gir'
df_comb2.iloc[:, filter].dropna(how='all')
Out[14]:
year2019
eventpgarbch
statgirgir
PLAYER NAME
Abraham Ancer56.94NaN
Kiradech Aphibarnrat54.17NaN
Ryan ArmourNaN55.56
Rich Beem47.22NaN
Rafa Cabrera Bello47.2258.33
.........
Boo WeekleyNaN59.72
Richy WerenskiNaN47.22
Danny Willett69.44NaN
Aaron Wise59.72NaN
Gary Woodland70.83NaN

124 rows × 2 columns

Let’s add a stat from a different year.

In [17]:
# add another year
# was going to try the append() method, but...
df9 = gs.csv_2_df_base('2020', 'pga', 'drv')
cols = pd.MultiIndex.from_tuples([('2020', 'pga', 'drv')])
#midx = pd.MultiIndex(levels=[['zero', 'one'], ['x','y']], labels=[[1,1,0,],[1,0,1,]])
df9.columns = cols
df_comb2 = pd.concat([df6, df7, df8, df9], axis=1)
df_comb2.columns.names = ['year', 'event', 'stat']
df_comb2 = gs.sort_player_name(df_comb2)
display(df_comb2)
year20192020
eventpgarbchpga
statdrvgirgirdrv
PLAYER NAME
Byeong Hun AnNaNNaNNaN286.6
Abraham Ancer285.656.94NaN295.6
Kiradech Aphibarnrat284.054.17NaNNaN
Ryan ArmourNaNNaN55.56NaN
Rich Beem281.947.22NaNNaN
...............
Danny Willett307.069.44NaNNaN
Aaron Wise301.059.72NaNNaN
Matthew WolffNaNNaNNaN303.8
Gary Woodland311.470.83NaN293.0
Tiger WoodsNaNNaNNaN304.0

153 rows × 4 columns

Finally, let’s get only the drv stat for players without any missing data.

In [20]:
filter = df_comb2.columns.get_level_values('stat') == 'drv'
df_comb2.iloc[:, filter].dropna(how='any')
Out[20]:
year20192020
eventpgapga
statdrvdrv
PLAYER NAME
Abraham Ancer285.6295.6
Daniel Berger301.5291.9
Patrick Cantlay314.6295.3
Paul Casey305.6295.8
Cameron Champ313.8321.1
Joel Dahmen295.8289.5
Jason Day310.0295.0
Tony Finau314.3305.3
Tommy Fleetwood301.3305.5
Emiliano Grillo302.9287.4
Adam Hadwin285.8282.9
Billy Horschel296.1284.4
Harold Varner III308.3286.8
Dustin Johnson335.6305.0
Sung Kang308.1292.9
Kurt Kitayama315.3302.9
Brooks Koepka313.0295.8
Danny Lee306.5291.0
Haotong Li293.9304.1
Luke List319.0299.6
Adam Long291.5285.4
Shane Lowry300.0287.6
Joost Luiten284.9286.0
Hideki Matsuyama303.5285.3
Rory McIlroy317.4312.5
Phil Mickelson318.6293.1
Alex Noren298.0289.0
Louis Oosthuizen306.0284.6
J.T. Poston291.5291.6
Chez Reavie289.1274.5
Erik van Rooyen305.3295.3
Justin Rose309.6294.4
Xander Schauffele307.0302.3
Adam Scott301.5299.1
Webb Simpson295.3281.1
Cameron Smith302.5286.3
Brandt Snedeker279.8280.9
Jordan Spieth303.3295.9
Michael Lorenzo-Vera302.0287.6
Matt Wallace300.3286.6
Gary Woodland311.4293.0

And, I think that’s enough for pd.concat(). There is also a method, .append(), available on DataFrames and Series. But I gather it is somewhat inefficient and should likely be avoided in favour of .concat(). So, I haven’t bothered to cover it, but link in the resources if you are interested.

pd.merge()

If you have ever worked with a relational database, what pd.merge() provides will be immediately clear to you. If not, it shouldn’t take much effort to understand. Relational databases, e.g. MySQL or PostgreSQL, are based on tables, just like a pandas DataFrame. Albeit there are some rules in building those tables that you may not have used when building your own tables in other applications.

Typically each row in a database table has a column or more that uniquely identifies that row. It is typically referred to as the key. You can think of this like the index, singlular or multi-indexed, of a DataFrame. Databases include the concept and/or operation of a join (something built on relational algebra). Pandas implements a few of the fundamental building blocks of relational algebra. These are provided via merge() and the related DataFrame method join().

Let’s look at few approaches to joining datasets. We’ll start with a simple one-to-one join.

Simple Join

Let’s build a couple of small datasets so that things will be easier to see. I have added a function to generate the dataframes I want for this example. Took a bit of work but it’s all about learning. Also got a function to display DataFrames side by side in the output frame of the notebook from GitHub. Not sure how that will translate to the output in the post. (The two functions are in the notebook related to this post.)

In this case, the matching column (key) in each dataframe will be the same size. And, no duplicates.

In [17]:
df1 = get_small_df('2020', 'pga', 'drv')
df2 = get_small_df('2020', 'pga', 'gir')
df3 = pd.merge(df1, df2)
display_side_by_side(df1, df2, df3)
PLAYER NAMEdrv
28Jason Day295.0
1Bryson DeChambeau318.1
19Harris English297.1
5Tony Finau305.3
4Tommy Fleetwood305.5
PLAYER NAMEgir
2Jason Day76.39
23Bryson DeChambeau66.67
75Harris English52.78
5Tony Finau72.22
40Tommy Fleetwood63.89
PLAYER NAMEdrvgir
0Jason Day295.076.39
1Bryson DeChambeau318.166.67
2Harris English297.152.78
3Tony Finau305.372.22
4Tommy Fleetwood305.563.89

This is a simple example, but merge() recognized that the two DataFrames had the PLAYER NAME column in common and joined the two DataFrames using that column as the key. Reindexing the rows in the process (i.e. discarding the original indices). (Sorry, don’t quite have the display sorted out.)

We could also have specified the column(s) to use. E.G. df4 = pd.merge(df1, df2, on='PLAYER NAME', sort=False). And, notebook shows that the output is identical.

Something A Little More Complicated

Let’s look at joining on multipe columns, then at how the merges/joins can be done.

In [20]:
# let's try using multiple columns
left = pd.DataFrame(
    {
        'firstname': ['John', 'Frank', 'Harry', 'Morris', 'Joseph'],
        'surname': ['Smith', 'Smith', 'Brown', 'White', 'Black'],
        'drv1': df1.drv.to_list(),
        'gir1': df2.gir.to_list()
    }
)
right = pd.DataFrame(
    {
        'firstname': ['John', 'Harry', 'Joseph', 'Morris', 'Arthur'],
        'surname': ['Smith', 'Brown', 'Black', 'Brown', 'White'],
        'drv2': df1.drv.to_list(),
        'gir2': df2.gir.to_list()
    }
)
df_comb = pd.merge(left, right, on=['firstname', 'surname'])
# default join is "inner", i.e. intersection
display_side_by_side(left, right)
display(df_comb)
firstnamesurnamedrv1gir1
0JohnSmith295.076.39
1FrankSmith318.166.67
2HarryBrown297.152.78
3MorrisWhite305.372.22
4JosephBlack305.563.89
firstnamesurnamedrv2gir2
0JohnSmith295.076.39
1HarryBrown318.166.67
2JosephBlack297.152.78
3MorrisBrown305.372.22
4ArthurWhite305.563.89
firstnamesurnamedrv1gir1drv2gir2
0JohnSmith295.076.39295.076.39
1HarryBrown297.152.78318.166.67
2JosephBlack305.563.89297.152.78

Join Types

You will notice that the resulting DataFrame contains only three rows. There are different “ways” in which pd.merge() determines which keys to include in the result. These are similar the join types used with databases. In the above case, the default is how='inner', which says only include keys appearing in both left and right. Once again quoting the pandas docs:

And a couple of examples. Do note, where a key combination does not appear in either the left or right frames, the values in the combined frame will be NA for any missing data.

In [21]:
# specify 'outer' to get a union
df_union = df_comb = pd.merge(left, right, on=['firstname', 'surname'], sort=False, how='outer')
display_side_by_side(left, right)
display(df_union)
firstnamesurnamedrv1gir1
0JohnSmith295.076.39
1FrankSmith318.166.67
2HarryBrown297.152.78
3MorrisWhite305.372.22
4JosephBlack305.563.89
firstnamesurnamedrv2gir2
0JohnSmith295.076.39
1HarryBrown318.166.67
2JosephBlack297.152.78
3MorrisBrown305.372.22
4ArthurWhite305.563.89
firstnamesurnamedrv1gir1drv2gir2
0JohnSmith295.076.39295.076.39
1FrankSmith318.166.67NaNNaN
2HarryBrown297.152.78318.166.67
3MorrisWhite305.372.22NaNNaN
4JosephBlack305.563.89297.152.78
5MorrisBrownNaNNaN305.372.22
6ArthurWhiteNaNNaN305.563.89
In [22]:
# or we can use the keys from on frame or the other
# let's try the left
df_left_join = df_comb = pd.merge(left, right, on=['firstname', 'surname'], sort=False, how='left')
display_side_by_side(left, right)
display(df_left_join)
firstnamesurnamedrv1gir1
0JohnSmith295.076.39
1FrankSmith318.166.67
2HarryBrown297.152.78
3MorrisWhite305.372.22
4JosephBlack305.563.89
firstnamesurnamedrv2gir2
0JohnSmith295.076.39
1HarryBrown318.166.67
2JosephBlack297.152.78
3MorrisBrown305.372.22
4ArthurWhite305.563.89
firstnamesurnamedrv1gir1drv2gir2
0JohnSmith295.076.39295.076.39
1FrankSmith318.166.67NaNNaN
2HarryBrown297.152.78318.166.67
3MorrisWhite305.372.22NaNNaN
4JosephBlack305.563.89297.152.78

One Last Example

If you have not worked with database tables, you may not anticipate the result of the folloiwng merge. The columns we wanted to merge on did not have the same name. So, we had to use the right_on= and left_on= named parameters to specify the columns to use.

In [24]:
left = pd.DataFrame(
    {
        'name': ['John', 'Frank', 'Harry', 'Morris', 'Joseph'],
        'department': ['finance', 'sales', 'engineering', 'sales', 'engineering'],
        'location': ['Vancouver', 'Vancouver', 'Vancouver', 'Calgary', 'Calgary'],
    }
)
right = pd.DataFrame(
    {
        'dept': ['engineering', 'finance', 'sales'],
        'supervisor': ['George', 'Helen', 'Barbara']
    }
)
display_side_by_side(left, right)
display(pd.merge(left, right, left_on='department', right_on='dept'))
namedepartmentlocation
0JohnfinanceVancouver
1FranksalesVancouver
2HarryengineeringVancouver
3MorrissalesCalgary
4JosephengineeringCalgary
deptsupervisor
0engineeringGeorge
1financeHelen
2salesBarbara
namedepartmentlocationdeptsupervisor
0JohnfinanceVancouverfinanceHelen
1FranksalesVancouversalesBarbara
2MorrissalesCalgarysalesBarbara
3HarryengineeringVancouverengineeringGeorge
4JosephengineeringCalgaryengineeringGeorge

Done For Now

There’s all sorts more uses for pd.merge(). And, I never looked at DataFrame.join(). But the docs in the resources section will take you a lot further than this post was ever meant to do.

If you wish to play with the above, feel free to download my notebook covering the contents of this post.

Resources