Open In Colab

Combining Datasets: Merge and Join

Let's see how we can combine several dataframes to one. Let's again start by defining some dummy dataframes.

In [ ]:
# student DataFrame
student_df = pd.DataFrame([{'Name': 'Thomas', 'School': 'Computer Science'}, {'Name': 'Reto', 'School': 'Law'}, {'Name': 'Christina', 'School': 'Business'}])
student_df = student_df.set_index('Name')

student_df
Out[ ]:
School
Name
Thomas Computer Science
Reto Law
Christina Business
In [ ]:
# staff DataFrame
staff_df = pd.DataFrame([{'Name': 'Myriam', 'Role': 'Director of HR'},
{'Name': 'Christina', 'Role': 'Course liasion'}, {'Name': 'Thomas', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name')

staff_df
Out[ ]:
Role
Name
Myriam Director of HR
Christina Course liasion
Thomas Grader

We can merge two dataframes with pandas' merge() function. The syntax looks like this.

pd.merge(left_df, right_df, how, left_index, right_index)

For the "how" parameter we must understand that four different ways of joining two dataframes exist. Let's have a look at all of them.

In general, four different ways of joining to dataframes exist

Merge Outer (Union)

In a full outer join, we We look at the keys from both DataFrames and create a new DataFrame with the columns of each. The row values in the columns correspond to the values in the original DataFrame.

In [ ]:
pd.merge(student_df, staff_df, how='outer', left_index=True, right_index=True)
Out[ ]:
School Role
Name
Christina Business Course liasion
Myriam NaN Director of HR
Reto Law NaN
Thomas Computer Science Grader

Inner join (Intersection)

We look at the keys from both DataFrames and create a new DataFrame with only the indices that appear in both DataFrames.

In [ ]:
pd.merge(student_df, staff_df, how='inner', left_index=True, right_index=True)
Out[ ]:
School Role
Name
Thomas Computer Science Grader
Christina Business Course liasion

Left join

We look at the keys from both DataFrames and create a new DataFrame with only the indices that appear in the left DataFrame.

In [ ]:
pd.merge(student_df, staff_df, how='left', left_index=True, right_index=True)
Out[ ]:
School Role
Name
Thomas Computer Science Grader
Reto Law NaN
Christina Business Course liasion

Right join

We look at the keys from both DataFrames and create a new DataFrame with only the indices that appear in the right DataFrame.

In [ ]:
pd.merge(student_df, staff_df, how='right', left_index=True, right_index=True)
Out[ ]:
School Role
Name
Myriam NaN Director of HR
Christina Business Course liasion
Thomas Computer Science Grader

Merging on Columns

So far, we have only merged the dataframes, according to row values. We can also specify columns according which the two dataframes should be merged. Here we use the "left/right_on"parameter.

In [ ]:
pd.merge(staff_df, student_df, how='outer', left_on='Name', right_on='Name')
Out[ ]:
Role School
Name
Myriam Director of HR NaN
Christina Course liasion Business
Thomas Grader Computer Science
Reto NaN Law

Merge conflict

Obviously, merging bigger dataframes can lead to problems. Let's look at the following example

In [ ]:
student_df = pd.DataFrame([{'Name': 'Thomas', 'School': 'Computer Science', 'Location': 'Museumsstrasse'}, {'Name': 'Reto', 'School': 'Law', 'Location':'Poststrasse'},
                           {'Name': 'Christina', 'School': 'Business',
                            'Location': 'Spisergasse'}])

student_df
Out[ ]:
Name School Location
0 Thomas Computer Science Museumsstrasse
1 Reto Law Poststrasse
2 Christina Business Spisergasse
In [ ]:
staff_df = pd.DataFrame([{'Name': 'Myriam', 'Role': 'Director of HR', 'Location': 'Dufourstrasse'},
                         {'Name': 'Christina', 'Role': 'Course liasion', 'Location': 'MF Strasse'}, 
                         {'Name': 'Thomas', 'Role': 'Grader', 'Location': 'Rosenbergstrasse'}])

staff_df
Out[ ]:
Name Role Location
0 Myriam Director of HR Dufourstrasse
1 Christina Course liasion MF Strasse
2 Thomas Grader Rosenbergstrasse
In [ ]:
pd.merge(student_df, staff_df, how='outer', left_on='Name', right_on='Name')
Out[ ]:
Name School Location_x Role Location_y
0 Thomas Computer Science Museumsstrasse Grader Rosenbergstrasse
1 Reto Law Poststrasse NaN NaN
2 Christina Business Spisergasse Course liasion MF Strasse
3 Myriam NaN NaN Director of HR Dufourstrasse

We can see that the merge function automatically created two columns for the Location. This is due to the fact that the location column in the staff_df specified the location of the office, whereas the location for the student_df stated their home adress.

Therefore, there are two different values for the same column name, which resulted in the creation of two seperate columns (x being from left, y being from right dataframe).

Joining over multiple columns

In [ ]:
staff_df = pd.DataFrame([{'First Name': 'Myriam', 'Last Name': 'Schmuck', 'Role': 'Director of HR'}, 
                         {'First Name': 'Christina', 'Last Name': 'Zenker','Role': 'Course liasion'}, 
                         {'First Name': 'Thomas', 'Last Name': 'Huber', 'Role': 'Grader'}])

staff_df
Out[ ]:
First Name Last Name Role
0 Myriam Schmuck Director of HR
1 Christina Zenker Course liasion
2 Thomas Huber Grader
In [ ]:
student_df = pd.DataFrame([{'First Name': 'Thomas', 'Last Name': 'Mayer','School': 'Business'},
                           {'First Name': 'Reto', 'Last Name': 'Schmied', 'School': 'Law'},
                           {'First Name': 'Christina', 'Last Name': 'Zenker', 'School': 'Business'}])

student_df
Out[ ]:
First Name Last Name School
0 Thomas Mayer Business
1 Reto Schmied Law
2 Christina Zenker Business

Often a single value is not enough to merge correctly. In this example there are two rows with ”First Name”: ”Thomas”, but they are different people with different last names. Merging on ”First Name” is not good enough here!

Therefore, we specify that we want to merge both on the First and Last Name Columns. As Christina Zenker is the only one that exists in both datasets (we want an intersection), she will be added to the new dataframe.

In [ ]:
pd.merge(student_df, staff_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])
Out[ ]:
First Name Last Name School Role
0 Christina Zenker Business Course liasion