The combining of data from different sources is a requirement in many, if not most, data science projects. Pandas provides a couple of ways of combining the data in different objects (i.e. Series and DataFrames). We have seen both of them used in the series of posts on Golf Stats. That is pandas.concat()
and pandas.merge()
.
The former, pd.concat()
provides for fairly straightforward concatenation of two different datasets. While the latter, pd.merge()
provides for more complicated database-like joins and merges. I expect knowing how to combine different datasets is a key data science skill.
Though I have already used them, I felt a bit of discussion/coverage was in order. If for no other reason than to help me reinforce what little I actually know about them, their strengths and weaknesses.
Python Package
I am going to try to use the golf stats dataframes in this discussion of joining datasets using pandas. So, I decided to put the code from the previous post into a package I could import
in the notebook I will be using to generate the code for this post. It took a bit of time and testing, but all in all seems to do the job. I did add a couple extra functions not in the previous post.
One, sort_player_name(df_in)
, to sort the golf stats on the player’s last name — kept it simple, didn’t involve the player’s first name, for better or worse. Another, csv_2_df_base(t_yr, t_id, p_st)
, to load the specified CSV file into a DataFrame without any of the mutli-indexing I was doing in csv_2_df(t_yr, t_id, p_st)
.
There are no tests in the package file. I did that in a notebook, which I have not kept.
The code is provided in a separate post. The post is not listed in the site’s post history.
import numpy as np
import pandas as pd
from pathlib import Path
import re
import golf_stats_lib as gs
gs?
gs.tourney_2_df?
pd.concat()
Concatenating Series and DataFrames is similar to concatenating NumPy arrays. Something we briefly looked at previously.
Okay, let’s load a couple of CSV files into DataFrames without messing with the indexing. Then concatenate them with pd.concat()
. And display the result.
df1 = gs.csv_2_df_base('2019', 'pga', 'drv')
display(df1)
df2 = gs.csv_2_df_base('2019', 'pga', 'gir')
df3 = pd.concat([df1, df2])
display(df3)
Definitely not what I wanted. And the first time I did the above, somewhat unexpected.
concat()
Defaults
Couple of things going on.
By default concatenation takes place along the row axis. I.E. the rows of each DataFrame being concatenated will be added to the result. This can be altered with the
axis=
named parameter.Pandas preserves indices, as does
concat()
, even if the result will have duplicate indices. Which the above does (though not really obvious). But look at the row counts in the above DataFrames.concat()
uses a union-like merge by default. This can be altered, somewhat, using thejoin=
named parameter.
There are some options for dealing with item 2. But at the moment that is not really a problem for me. I am also currently fine with the default of item 3. So, I will look at changing the default axis.
axis=1
df3 = pd.concat([df1, df2], axis=1)
df_sort = gs.sort_player_name(df3)
display(df_sort)
Check the row count. That appears to have worked, more or less as desired. Okay let’s add another stat from a different tournament.
df4 = gs.csv_2_df_base('2019','rbch','gir')
#display(df4)
df5 = pd.concat([df3, df4], axis=1)
#display(df5)
df_sort = gs.sort_player_name(df5)
display(df_sort)
Well, stats are there but no indication as to which tournament they are from. And, not really something we can ignore or would want to keep track of separately. Afterall if we did what would be the point in using pandas? So let’s add some multi-indexing on the columns of the individual stat DataFrames before we concatenate.
# unfortunately the above does not tell us to which tournament and year the stat belongs
# and likely not something we want to keep track of separately
# So let's try adding that info to each csv dataframe
df6 = gs.csv_2_df_base('2019', 'pga', 'drv')
cols = pd.MultiIndex.from_tuples([('2019', 'pga', 'drv')])
#midx = pd.MultiIndex(levels=[['zero', 'one'], ['x','y']], labels=[[1,1,0,],[1,0,1,]])
df6.columns = cols
df6 = gs.sort_player_name(df6)
display(df6)
# now another stat same year and tourney
df7 = gs.csv_2_df_base('2019', 'pga', 'gir')
cols = pd.MultiIndex.from_tuples([('2019', 'pga', 'gir')])
df7.columns = cols
df7 = gs.sort_player_name(df7)
df_comb1 = pd.concat([df6, df7], axis=1)
display(df_comb1.head())
display(df_comb1.loc['Rory McIlroy'])
# and a different tournament
df8 = gs.csv_2_df_base('2019', 'rbch', 'gir')
cols = pd.MultiIndex.from_tuples([('2019', 'rbch', 'gir')])
df8.columns = cols
df8 = gs.sort_player_name(df8)
df_comb2 = pd.concat([df6, df7, df8], axis=1)
df_comb2 = gs.sort_player_name(df_comb2)
display(df_comb2)
.columns.names
To make things a touch nicer let’s give our column indices labels. Then get the stats just for gir. A simple example of filtering on multi-indexed columns.
#df_comb2.columns
df_comb2.columns.names = ['year', 'event', 'stat']
filter = df_comb2.columns.get_level_values('stat') == 'gir'
df_comb2.iloc[:, filter].dropna(how='all')
Let’s add a stat from a different year.
# add another year
# was going to try the append() method, but...
df9 = gs.csv_2_df_base('2020', 'pga', 'drv')
cols = pd.MultiIndex.from_tuples([('2020', 'pga', 'drv')])
#midx = pd.MultiIndex(levels=[['zero', 'one'], ['x','y']], labels=[[1,1,0,],[1,0,1,]])
df9.columns = cols
df_comb2 = pd.concat([df6, df7, df8, df9], axis=1)
df_comb2.columns.names = ['year', 'event', 'stat']
df_comb2 = gs.sort_player_name(df_comb2)
display(df_comb2)
Finally, let’s get only the drv stat for players without any missing data.
filter = df_comb2.columns.get_level_values('stat') == 'drv'
df_comb2.iloc[:, filter].dropna(how='any')
And, I think that’s enough for pd.concat()
. There is also a method, .append()
, available on DataFrames and Series. But I gather it is somewhat inefficient and should likely be avoided in favour of .concat()
. So, I haven’t bothered to cover it, but link in the resources if you are interested.
pd.merge()
If you have ever worked with a relational database, what pd.merge()
provides will be immediately clear to you. If not, it shouldn’t take much effort to understand. Relational databases, e.g. MySQL or PostgreSQL, are based on tables, just like a pandas DataFrame. Albeit there are some rules in building those tables that you may not have used when building your own tables in other applications.
Typically each row in a database table has a column or more that uniquely identifies that row. It is typically referred to as the key
. You can think of this like the index, singlular or multi-indexed, of a DataFrame. Databases include the concept and/or operation of a join
(something built on relational algebra). Pandas implements a few of the fundamental building blocks of relational algebra. These are provided via merge()
and the related DataFrame method join()
.
Let’s look at few approaches to joining datasets. We’ll start with a simple one-to-one join.
Simple Join
Let’s build a couple of small datasets so that things will be easier to see. I have added a function to generate the dataframes I want for this example. Took a bit of work but it’s all about learning. Also got a function to display DataFrames side by side in the output frame of the notebook from GitHub. Not sure how that will translate to the output in the post. (The two functions are in the notebook related to this post.)
In this case, the matching column (key) in each dataframe will be the same size. And, no duplicates.
df1 = get_small_df('2020', 'pga', 'drv')
df2 = get_small_df('2020', 'pga', 'gir')
df3 = pd.merge(df1, df2)
display_side_by_side(df1, df2, df3)
This is a simple example, but merge()
recognized that the two DataFrames had the PLAYER NAME column in common and joined the two DataFrames using that column as the key. Reindexing the rows in the process (i.e. discarding the original indices). (Sorry, don’t quite have the display sorted out.)
We could also have specified the column(s) to use. E.G. df4 = pd.merge(df1, df2, on='PLAYER NAME', sort=False)
. And, notebook shows that the output is identical.
Something A Little More Complicated
Let’s look at joining on multipe columns, then at how
the merges/joins can be done.
# let's try using multiple columns
left = pd.DataFrame(
{
'firstname': ['John', 'Frank', 'Harry', 'Morris', 'Joseph'],
'surname': ['Smith', 'Smith', 'Brown', 'White', 'Black'],
'drv1': df1.drv.to_list(),
'gir1': df2.gir.to_list()
}
)
right = pd.DataFrame(
{
'firstname': ['John', 'Harry', 'Joseph', 'Morris', 'Arthur'],
'surname': ['Smith', 'Brown', 'Black', 'Brown', 'White'],
'drv2': df1.drv.to_list(),
'gir2': df2.gir.to_list()
}
)
df_comb = pd.merge(left, right, on=['firstname', 'surname'])
# default join is "inner", i.e. intersection
display_side_by_side(left, right)
display(df_comb)
Join Types
You will notice that the resulting DataFrame contains only three rows. There are different “ways” in which pd.merge()
determines which keys to include in the result. These are similar the join types used with databases. In the above case, the default is how='inner'
, which says only include keys appearing in both left
and right
. Once again quoting the pandas docs:
And a couple of examples. Do note, where a key combination does not appear in either the left or right frames, the values in the combined frame will be NA for any missing data.
# specify 'outer' to get a union
df_union = df_comb = pd.merge(left, right, on=['firstname', 'surname'], sort=False, how='outer')
display_side_by_side(left, right)
display(df_union)
# or we can use the keys from on frame or the other
# let's try the left
df_left_join = df_comb = pd.merge(left, right, on=['firstname', 'surname'], sort=False, how='left')
display_side_by_side(left, right)
display(df_left_join)
One Last Example
If you have not worked with database tables, you may not anticipate the result of the folloiwng merge. The columns we wanted to merge on did not have the same name. So, we had to use the right_on=
and left_on=
named parameters to specify the columns to use.
left = pd.DataFrame(
{
'name': ['John', 'Frank', 'Harry', 'Morris', 'Joseph'],
'department': ['finance', 'sales', 'engineering', 'sales', 'engineering'],
'location': ['Vancouver', 'Vancouver', 'Vancouver', 'Calgary', 'Calgary'],
}
)
right = pd.DataFrame(
{
'dept': ['engineering', 'finance', 'sales'],
'supervisor': ['George', 'Helen', 'Barbara']
}
)
display_side_by_side(left, right)
display(pd.merge(left, right, left_on='department', right_on='dept'))
Done For Now
There’s all sorts more uses for pd.merge()
. And, I never looked at DataFrame.join()
. But the docs in the resources section will take you a lot further than this post was ever meant to do.
If you wish to play with the above, feel free to download my notebook covering the contents of this post.
Resources
- ** pandas-docs: Merge, join, concatenate and compare
- selecting from multi-index pandas
- pandas.IndexSlice
- pandas.DataFrame.append
- pandas.DataFrame.dropna
- Using Hierarchical Indexes With Pandas
- How to change standard columns to MultiIndex
- Solve Pandas “ValueError: cannot reindex from a duplicate axis”
- How to Check if a File or Directory Exists in Python
- Sort Dataframe by substrings of a column
- Pandas make new column from string slice of another column
- How do I sort a whole pandas dataframe by one column, moving the rows grouped in 3s
- alejio / display_side_by_side