The combining of data from different sources is a requirement in many, if not most, data science projects. Pandas provides a couple of ways of combining the data in different objects (i.e. Series and DataFrames). We have seen both of them used in the series of posts on Golf Stats. That is pandas.concat() and pandas.merge().

The former, pd.concat() provides for fairly straightforward concatenation of two different datasets. While the latter, pd.merge() provides for more complicated database-like joins and merges. I expect knowing how to combine different datasets is a key data science skill.

Though I have already used them, I felt a bit of discussion/coverage was in order. If for no other reason than to help me reinforce what little I actually know about them, their strengths and weaknesses.

Python Package

I am going to try to use the golf stats dataframes in this discussion of joining datasets using pandas. So, I decided to put the code from the previous post into a package I could import in the notebook I will be using to generate the code for this post. It took a bit of time and testing, but all in all seems to do the job. I did add a couple extra functions not in the previous post.

One, sort_player_name(df_in), to sort the golf stats on the player’s last name — kept it simple, didn’t involve the player’s first name, for better or worse. Another, csv_2_df_base(t_yr, t_id, p_st), to load the specified CSV file into a DataFrame without any of the mutli-indexing I was doing in csv_2_df(t_yr, t_id, p_st).

There are no tests in the package file. I did that in a notebook, which I have not kept.

The code is provided in a separate post. The post is not listed in the site’s post history.

In [1]:

import numpy as np
import pandas as pd
from pathlib import Path
import re
import golf_stats_lib as gs

In [2]:

gs?

Type: module String form: <module 'golf_stats_lib' from 'r:\\learn\\ds_intro\\golf_stats_lib.py'> File: r:\learn\ds_intro\golf_stats_lib.py Docstring: Module to provide functions related to obtaining statistics from PGA web site and saving to local CSV files. As well as functions to retrieve the data in the CSVs and build DataFrames to hold a specified dataset.

File: ./ds_intro/golf_stats_lib.py

tied to various Too Old To Code blog posts
function(s) to obtain data from PGA web site and save to local CSV files
functions(s) to process the CSV files into DataFrames in various ways
built on Jypyter notebook pandas_play_8.ipynb

Module level and/or global variables:

These may require updating if you add events or stat types.

events - dictionary of currently available events providing PGA Tour ids stats - dictionary of currently available stat types providing PGA Tour ids st_cols - for each stat type, a list of table columns to save to csv d_dir - directory to which to save stat csv files

Functions:

get_csv_nm(t_yr, t_id, p_st) stats_2_csv(tyrs, tids, psts) csv_2_df_base(t_yr, t_id, p_st) csv_2_df(t_yr, t_id, p_st) tourney_2_df(t_yr, t_id, psts) year_2_df(t_yr, tids, psts) golf_stats_2_df(tyrs, tids, psts) sort_player_name(df_in)

In [3]:

gs.tourney_2_df?

Signature: gs.tourney_2_df(t_yr, t_id, psts)
Docstring:
Combine all requested stats for a given tournament and year into a single DataFrame. 
Return DataFrame.
Useage: tourney_2_df(t_yr, t_id, p_sts)
  where t_yr = tournament year
        t_id = tournament id (e.g. 'pga')
        psts = list of player stats (e.g. ['drv', 'gir'])
File:      r:\learn\ds_intro\golf_stats_lib.py
Type:      function

`pd.concat()`

Concatenating Series and DataFrames is similar to concatenating NumPy arrays. Something we briefly looked at previously.

Okay, let’s load a couple of CSV files into DataFrames without messing with the indexing. Then concatenate them with pd.concat(). And display the result.

In [4]:

df1 = gs.csv_2_df_base('2019', 'pga', 'drv')
display(df1)

	drv
PLAYER NAME
Dustin Johnson	335.6
Luke List	319.0
Phil Mickelson	318.6
Rory McIlroy	317.4
Lucas Bjerregaard	316.6
...	...
Marty Jertson	281.9
Rich Beem	281.9
Henrik Stenson	281.0
Brandt Snedeker	279.8
Andrew Putnam	274.1

82 rows × 1 columns

In [5]:

df2 = gs.csv_2_df_base('2019', 'pga', 'gir')
df3 = pd.concat([df1, df2])
display(df3)

	drv	gir
PLAYER NAME
Dustin Johnson	335.6	NaN
Luke List	319.0	NaN
Phil Mickelson	318.6	NaN
Rory McIlroy	317.4	NaN
Lucas Bjerregaard	316.6	NaN
...	...	...
Rafa Cabrera Bello	NaN	47.22
Ryan Vermeer	NaN	47.22
Pat Perez	NaN	47.22
Rich Beem	NaN	47.22
Marty Jertson	NaN	41.67

164 rows × 2 columns

Definitely not what I wanted. And the first time I did the above, somewhat unexpected.

`concat()` Defaults

Couple of things going on.

By default concatenation takes place along the row axis. I.E. the rows of each DataFrame being concatenated will be added to the result. This can be altered with the axis= named parameter.
Pandas preserves indices, as does concat(), even if the result will have duplicate indices. Which the above does (though not really obvious). But look at the row counts in the above DataFrames.
concat() uses a union-like merge by default. This can be altered, somewhat, using the join= named parameter.

There are some options for dealing with item 2. But at the moment that is not really a problem for me. I am also currently fine with the default of item 3. So, I will look at changing the default axis.

`axis=1`

In [9]:

df3 = pd.concat([df1, df2], axis=1)
df_sort = gs.sort_player_name(df3)
display(df_sort)

	drv	gir
PLAYER NAME
Abraham Ancer	285.6	56.94
Kiradech Aphibarnrat	284.0	54.17
Rich Beem	281.9	47.22
Rafa Cabrera Bello	292.3	47.22
Daniel Berger	301.5	55.56
...	...	...
Jimmy Walker	304.6	69.44
Matt Wallace	300.3	56.94
Danny Willett	307.0	69.44
Aaron Wise	301.0	59.72
Gary Woodland	311.4	70.83

82 rows × 2 columns

Check the row count. That appears to have worked, more or less as desired. Okay let’s add another stat from a different tournament.

In [10]:

df4 = gs.csv_2_df_base('2019','rbch','gir')
#display(df4)
df5 = pd.concat([df3, df4], axis=1)
#display(df5)
df_sort = gs.sort_player_name(df5)
display(df_sort)

	drv	gir	gir
PLAYER NAME
Abraham Ancer	285.6	56.94	NaN
Kiradech Aphibarnrat	284.0	54.17	NaN
Ryan Armour	NaN	NaN	55.56
Rich Beem	281.9	47.22	NaN
Rafa Cabrera Bello	292.3	47.22	58.33
...	...	...	...
Boo Weekley	NaN	NaN	59.72
Richy Werenski	NaN	NaN	47.22
Danny Willett	307.0	69.44	NaN
Aaron Wise	301.0	59.72	NaN
Gary Woodland	311.4	70.83	NaN

124 rows × 3 columns

Well, stats are there but no indication as to which tournament they are from. And, not really something we can ignore or would want to keep track of separately. Afterall if we did what would be the point in using pandas? So let’s add some multi-indexing on the columns of the individual stat DataFrames before we concatenate.

In [11]:

# unfortunately the above does not tell us to which tournament and year the stat belongs
# and likely not something we want to keep track of separately
# So let's try adding that info to each csv dataframe
df6 = gs.csv_2_df_base('2019', 'pga', 'drv')
cols = pd.MultiIndex.from_tuples([('2019', 'pga', 'drv')])
#midx = pd.MultiIndex(levels=[['zero', 'one'], ['x','y']], labels=[[1,1,0,],[1,0,1,]])
df6.columns = cols
df6 = gs.sort_player_name(df6)
display(df6)

	2019
	pga
	drv
PLAYER NAME
Abraham Ancer	285.6
Kiradech Aphibarnrat	284.0
Rich Beem	281.9
Rafa Cabrera Bello	292.3
Daniel Berger	301.5
...	...
Jimmy Walker	304.6
Matt Wallace	300.3
Danny Willett	307.0
Aaron Wise	301.0
Gary Woodland	311.4

82 rows × 1 columns

In [12]:

# now another stat same year and tourney
df7 = gs.csv_2_df_base('2019', 'pga', 'gir')
cols = pd.MultiIndex.from_tuples([('2019', 'pga', 'gir')])
df7.columns = cols
df7 = gs.sort_player_name(df7)
df_comb1 = pd.concat([df6, df7], axis=1)
display(df_comb1.head())
display(df_comb1.loc['Rory McIlroy'])

	2019
	pga
	drv	gir
PLAYER NAME
Abraham Ancer	285.6	56.94
Kiradech Aphibarnrat	284.0	54.17
Rich Beem	281.9	47.22
Rafa Cabrera Bello	292.3	47.22
Daniel Berger	301.5	55.56

2019  pga  drv    317.40
           gir     65.28
Name: Rory McIlroy, dtype: float64

In [13]:

# and a different tournament
df8 = gs.csv_2_df_base('2019', 'rbch', 'gir')
cols = pd.MultiIndex.from_tuples([('2019', 'rbch', 'gir')])
df8.columns = cols
df8 = gs.sort_player_name(df8)
df_comb2 = pd.concat([df6, df7, df8], axis=1)
df_comb2 = gs.sort_player_name(df_comb2)
display(df_comb2)

	2019
	pga		rbch
	drv	gir	gir
PLAYER NAME
Abraham Ancer	285.6	56.94	NaN
Kiradech Aphibarnrat	284.0	54.17	NaN
Ryan Armour	NaN	NaN	55.56
Rich Beem	281.9	47.22	NaN
Rafa Cabrera Bello	292.3	47.22	58.33
...	...	...	...
Boo Weekley	NaN	NaN	59.72
Richy Werenski	NaN	NaN	47.22
Danny Willett	307.0	69.44	NaN
Aaron Wise	301.0	59.72	NaN
Gary Woodland	311.4	70.83	NaN

124 rows × 3 columns

`.columns.names`

To make things a touch nicer let’s give our column indices labels. Then get the stats just for gir. A simple example of filtering on multi-indexed columns.

In [14]:

#df_comb2.columns
df_comb2.columns.names = ['year', 'event', 'stat']
filter = df_comb2.columns.get_level_values('stat') == 'gir'
df_comb2.iloc[:, filter].dropna(how='all')

Out[14]:

year	2019
event	pga	rbch
stat	gir	gir
PLAYER NAME
Abraham Ancer	56.94	NaN
Kiradech Aphibarnrat	54.17	NaN
Ryan Armour	NaN	55.56
Rich Beem	47.22	NaN
Rafa Cabrera Bello	47.22	58.33
...	...	...
Boo Weekley	NaN	59.72
Richy Werenski	NaN	47.22
Danny Willett	69.44	NaN
Aaron Wise	59.72	NaN
Gary Woodland	70.83	NaN

124 rows × 2 columns

Let’s add a stat from a different year.

In [17]:

# add another year
# was going to try the append() method, but...
df9 = gs.csv_2_df_base('2020', 'pga', 'drv')
cols = pd.MultiIndex.from_tuples([('2020', 'pga', 'drv')])
#midx = pd.MultiIndex(levels=[['zero', 'one'], ['x','y']], labels=[[1,1,0,],[1,0,1,]])
df9.columns = cols
df_comb2 = pd.concat([df6, df7, df8, df9], axis=1)
df_comb2.columns.names = ['year', 'event', 'stat']
df_comb2 = gs.sort_player_name(df_comb2)
display(df_comb2)

year	2019			2020
event	pga		rbch	pga
stat	drv	gir	gir	drv
PLAYER NAME
Byeong Hun An	NaN	NaN	NaN	286.6
Abraham Ancer	285.6	56.94	NaN	295.6
Kiradech Aphibarnrat	284.0	54.17	NaN	NaN
Ryan Armour	NaN	NaN	55.56	NaN
Rich Beem	281.9	47.22	NaN	NaN
...	...	...	...	...
Danny Willett	307.0	69.44	NaN	NaN
Aaron Wise	301.0	59.72	NaN	NaN
Matthew Wolff	NaN	NaN	NaN	303.8
Gary Woodland	311.4	70.83	NaN	293.0
Tiger Woods	NaN	NaN	NaN	304.0

153 rows × 4 columns

Finally, let’s get only the drv stat for players without any missing data.

In [20]:

filter = df_comb2.columns.get_level_values('stat') == 'drv'
df_comb2.iloc[:, filter].dropna(how='any')

Out[20]:

year	2019	2020
event	pga	pga
stat	drv	drv
PLAYER NAME
Abraham Ancer	285.6	295.6
Daniel Berger	301.5	291.9
Patrick Cantlay	314.6	295.3
Paul Casey	305.6	295.8
Cameron Champ	313.8	321.1
Joel Dahmen	295.8	289.5
Jason Day	310.0	295.0
Tony Finau	314.3	305.3
Tommy Fleetwood	301.3	305.5
Emiliano Grillo	302.9	287.4
Adam Hadwin	285.8	282.9
Billy Horschel	296.1	284.4
Harold Varner III	308.3	286.8
Dustin Johnson	335.6	305.0
Sung Kang	308.1	292.9
Kurt Kitayama	315.3	302.9
Brooks Koepka	313.0	295.8
Danny Lee	306.5	291.0
Haotong Li	293.9	304.1
Luke List	319.0	299.6
Adam Long	291.5	285.4
Shane Lowry	300.0	287.6
Joost Luiten	284.9	286.0
Hideki Matsuyama	303.5	285.3
Rory McIlroy	317.4	312.5
Phil Mickelson	318.6	293.1
Alex Noren	298.0	289.0
Louis Oosthuizen	306.0	284.6
J.T. Poston	291.5	291.6
Chez Reavie	289.1	274.5
Erik van Rooyen	305.3	295.3
Justin Rose	309.6	294.4
Xander Schauffele	307.0	302.3
Adam Scott	301.5	299.1
Webb Simpson	295.3	281.1
Cameron Smith	302.5	286.3
Brandt Snedeker	279.8	280.9
Jordan Spieth	303.3	295.9
Michael Lorenzo-Vera	302.0	287.6
Matt Wallace	300.3	286.6
Gary Woodland	311.4	293.0

And, I think that’s enough for pd.concat(). There is also a method, .append(), available on DataFrames and Series. But I gather it is somewhat inefficient and should likely be avoided in favour of .concat(). So, I haven’t bothered to cover it, but link in the resources if you are interested.

`pd.merge()`

If you have ever worked with a relational database, what pd.merge() provides will be immediately clear to you. If not, it shouldn’t take much effort to understand. Relational databases, e.g. MySQL or PostgreSQL, are based on tables, just like a pandas DataFrame. Albeit there are some rules in building those tables that you may not have used when building your own tables in other applications.

Typically each row in a database table has a column or more that uniquely identifies that row. It is typically referred to as the key. You can think of this like the index, singlular or multi-indexed, of a DataFrame. Databases include the concept and/or operation of a join (something built on relational algebra). Pandas implements a few of the fundamental building blocks of relational algebra. These are provided via merge() and the related DataFrame method join().

Let’s look at few approaches to joining datasets. We’ll start with a simple one-to-one join.

Simple Join

Let’s build a couple of small datasets so that things will be easier to see. I have added a function to generate the dataframes I want for this example. Took a bit of work but it’s all about learning. Also got a function to display DataFrames side by side in the output frame of the notebook from GitHub. Not sure how that will translate to the output in the post. (The two functions are in the notebook related to this post.)

In this case, the matching column (key) in each dataframe will be the same size. And, no duplicates.

In [17]:

df1 = get_small_df('2020', 'pga', 'drv')
df2 = get_small_df('2020', 'pga', 'gir')
df3 = pd.merge(df1, df2)
display_side_by_side(df1, df2, df3)

	PLAYER NAME	drv
28	Jason Day	295.0
1	Bryson DeChambeau	318.1
19	Harris English	297.1
5	Tony Finau	305.3
4	Tommy Fleetwood	305.5

	PLAYER NAME	gir
2	Jason Day	76.39
23	Bryson DeChambeau	66.67
75	Harris English	52.78
5	Tony Finau	72.22
40	Tommy Fleetwood	63.89

	PLAYER NAME	drv	gir
0	Jason Day	295.0	76.39
1	Bryson DeChambeau	318.1	66.67
2	Harris English	297.1	52.78
3	Tony Finau	305.3	72.22
4	Tommy Fleetwood	305.5	63.89

This is a simple example, but merge() recognized that the two DataFrames had the PLAYER NAME column in common and joined the two DataFrames using that column as the key. Reindexing the rows in the process (i.e. discarding the original indices). (Sorry, don’t quite have the display sorted out.)

We could also have specified the column(s) to use. E.G. df4 = pd.merge(df1, df2, on='PLAYER NAME', sort=False). And, notebook shows that the output is identical.

Something A Little More Complicated

Let’s look at joining on multipe columns, then at how the merges/joins can be done.

In [20]:

# let's try using multiple columns
left = pd.DataFrame(
    {
        'firstname': ['John', 'Frank', 'Harry', 'Morris', 'Joseph'],
        'surname': ['Smith', 'Smith', 'Brown', 'White', 'Black'],
        'drv1': df1.drv.to_list(),
        'gir1': df2.gir.to_list()
    }
)
right = pd.DataFrame(
    {
        'firstname': ['John', 'Harry', 'Joseph', 'Morris', 'Arthur'],
        'surname': ['Smith', 'Brown', 'Black', 'Brown', 'White'],
        'drv2': df1.drv.to_list(),
        'gir2': df2.gir.to_list()
    }
)
df_comb = pd.merge(left, right, on=['firstname', 'surname'])
# default join is "inner", i.e. intersection
display_side_by_side(left, right)
display(df_comb)

	firstname	surname	drv1	gir1
0	John	Smith	295.0	76.39
1	Frank	Smith	318.1	66.67
2	Harry	Brown	297.1	52.78
3	Morris	White	305.3	72.22
4	Joseph	Black	305.5	63.89

	firstname	surname	drv2	gir2
0	John	Smith	295.0	76.39
1	Harry	Brown	318.1	66.67
2	Joseph	Black	297.1	52.78
3	Morris	Brown	305.3	72.22
4	Arthur	White	305.5	63.89

	firstname	surname	drv1	gir1	drv2	gir2
0	John	Smith	295.0	76.39	295.0	76.39
1	Harry	Brown	297.1	52.78	318.1	66.67
2	Joseph	Black	305.5	63.89	297.1	52.78

Join Types

You will notice that the resulting DataFrame contains only three rows. There are different “ways” in which pd.merge() determines which keys to include in the result. These are similar the join types used with databases. In the above case, the default is how='inner', which says only include keys appearing in both left and right. Once again quoting the pandas docs:

”

Merge method	SQL Join Name	Description
`left`	`LEFT OUTER JOIN`	Use keys from left frame only
`right`	`RIGHT OUTER JOIN`	Use keys from right frame only
`outer`	`FULL OUTER JOIN`	Use union of keys from both frames
`inner`	`INNER JOIN`	Use intersection of keys from both frames

Brief primer on merge methods, pandas-docs

And a couple of examples. Do note, where a key combination does not appear in either the left or right frames, the values in the combined frame will be NA for any missing data.

In [21]:

# specify 'outer' to get a union
df_union = df_comb = pd.merge(left, right, on=['firstname', 'surname'], sort=False, how='outer')
display_side_by_side(left, right)
display(df_union)

	firstname	surname	drv1	gir1
0	John	Smith	295.0	76.39
1	Frank	Smith	318.1	66.67
2	Harry	Brown	297.1	52.78
3	Morris	White	305.3	72.22
4	Joseph	Black	305.5	63.89

	firstname	surname	drv2	gir2
0	John	Smith	295.0	76.39
1	Harry	Brown	318.1	66.67
2	Joseph	Black	297.1	52.78
3	Morris	Brown	305.3	72.22
4	Arthur	White	305.5	63.89

	firstname	surname	drv1	gir1	drv2	gir2
0	John	Smith	295.0	76.39	295.0	76.39
1	Frank	Smith	318.1	66.67	NaN	NaN
2	Harry	Brown	297.1	52.78	318.1	66.67
3	Morris	White	305.3	72.22	NaN	NaN
4	Joseph	Black	305.5	63.89	297.1	52.78
5	Morris	Brown	NaN	NaN	305.3	72.22
6	Arthur	White	NaN	NaN	305.5	63.89

In [22]:

# or we can use the keys from on frame or the other
# let's try the left
df_left_join = df_comb = pd.merge(left, right, on=['firstname', 'surname'], sort=False, how='left')
display_side_by_side(left, right)
display(df_left_join)

	firstname	surname	drv1	gir1
0	John	Smith	295.0	76.39
1	Frank	Smith	318.1	66.67
2	Harry	Brown	297.1	52.78
3	Morris	White	305.3	72.22
4	Joseph	Black	305.5	63.89

	firstname	surname	drv2	gir2
0	John	Smith	295.0	76.39
1	Harry	Brown	318.1	66.67
2	Joseph	Black	297.1	52.78
3	Morris	Brown	305.3	72.22
4	Arthur	White	305.5	63.89

	firstname	surname	drv1	gir1	drv2	gir2
0	John	Smith	295.0	76.39	295.0	76.39
1	Frank	Smith	318.1	66.67	NaN	NaN
2	Harry	Brown	297.1	52.78	318.1	66.67
3	Morris	White	305.3	72.22	NaN	NaN
4	Joseph	Black	305.5	63.89	297.1	52.78

One Last Example

If you have not worked with database tables, you may not anticipate the result of the folloiwng merge. The columns we wanted to merge on did not have the same name. So, we had to use the right_on= and left_on= named parameters to specify the columns to use.

In [24]:

left = pd.DataFrame(
    {
        'name': ['John', 'Frank', 'Harry', 'Morris', 'Joseph'],
        'department': ['finance', 'sales', 'engineering', 'sales', 'engineering'],
        'location': ['Vancouver', 'Vancouver', 'Vancouver', 'Calgary', 'Calgary'],
    }
)
right = pd.DataFrame(
    {
        'dept': ['engineering', 'finance', 'sales'],
        'supervisor': ['George', 'Helen', 'Barbara']
    }
)
display_side_by_side(left, right)
display(pd.merge(left, right, left_on='department', right_on='dept'))

	name	department	location
0	John	finance	Vancouver
1	Frank	sales	Vancouver
2	Harry	engineering	Vancouver
3	Morris	sales	Calgary
4	Joseph	engineering	Calgary

	dept	supervisor
0	engineering	George
1	finance	Helen
2	sales	Barbara

	name	department	location	dept	supervisor
0	John	finance	Vancouver	finance	Helen
1	Frank	sales	Vancouver	sales	Barbara
2	Morris	sales	Calgary	sales	Barbara
3	Harry	engineering	Vancouver	engineering	George
4	Joseph	engineering	Calgary	engineering	George

Done For Now

There’s all sorts more uses for pd.merge(). And, I never looked at DataFrame.join(). But the docs in the resources section will take you a lot further than this post was ever meant to do.

If you wish to play with the above, feel free to download my notebook covering the contents of this post.

Resources

** pandas-docs: Merge, join, concatenate and compare
selecting from multi-index pandas
pandas.IndexSlice
pandas.DataFrame.append
pandas.DataFrame.dropna
Using Hierarchical Indexes With Pandas
How to change standard columns to MultiIndex
Solve Pandas “ValueError: cannot reindex from a duplicate axis”
How to Check if a File or Directory Exists in Python
Sort Dataframe by substrings of a column
Pandas make new column from string slice of another column
How do I sort a whole pandas dataframe by one column, moving the rows grouped in 3s
alejio / display_side_by_side