< Previous Page | Home Page | Next Page >
Let's see how we can combine several dataframes to one. Let's again start by defining some dummy dataframes.
# student DataFrame
student_df = pd.DataFrame([{'Name': 'Thomas', 'School': 'Computer Science'}, {'Name': 'Reto', 'School': 'Law'}, {'Name': 'Christina', 'School': 'Business'}])
student_df = student_df.set_index('Name')
student_df
# staff DataFrame
staff_df = pd.DataFrame([{'Name': 'Myriam', 'Role': 'Director of HR'},
{'Name': 'Christina', 'Role': 'Course liasion'}, {'Name': 'Thomas', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name')
staff_df
We can merge two dataframes with pandas' merge()
function. The syntax looks like this.
pd.merge(left_df, right_df, how, left_index, right_index)
For the "how" parameter we must understand that four different ways of joining two dataframes exist. Let's have a look at all of them.
In a full outer join, we We look at the keys from both DataFrames and create a new DataFrame with the columns of each. The row values in the columns correspond to the values in the original DataFrame.
pd.merge(student_df, staff_df, how='outer', left_index=True, right_index=True)
We look at the keys from both DataFrames and create a new DataFrame with only the indices that appear in both DataFrames.
pd.merge(student_df, staff_df, how='inner', left_index=True, right_index=True)
We look at the keys from both DataFrames and create a new DataFrame with only the indices that appear in the left DataFrame.
pd.merge(student_df, staff_df, how='left', left_index=True, right_index=True)
We look at the keys from both DataFrames and create a new DataFrame with only the indices that appear in the right DataFrame.
pd.merge(student_df, staff_df, how='right', left_index=True, right_index=True)
So far, we have only merged the dataframes, according to row values. We can also specify columns according which the two dataframes should be merged. Here we use the "left/right_on"
parameter.
pd.merge(staff_df, student_df, how='outer', left_on='Name', right_on='Name')
Obviously, merging bigger dataframes can lead to problems. Let's look at the following example
student_df = pd.DataFrame([{'Name': 'Thomas', 'School': 'Computer Science', 'Location': 'Museumsstrasse'}, {'Name': 'Reto', 'School': 'Law', 'Location':'Poststrasse'},
{'Name': 'Christina', 'School': 'Business',
'Location': 'Spisergasse'}])
student_df
staff_df = pd.DataFrame([{'Name': 'Myriam', 'Role': 'Director of HR', 'Location': 'Dufourstrasse'},
{'Name': 'Christina', 'Role': 'Course liasion', 'Location': 'MF Strasse'},
{'Name': 'Thomas', 'Role': 'Grader', 'Location': 'Rosenbergstrasse'}])
staff_df
pd.merge(student_df, staff_df, how='outer', left_on='Name', right_on='Name')
We can see that the merge function automatically created two columns for the Location. This is due to the fact that the location column in the staff_df specified the location of the office, whereas the location for the student_df stated their home adress.
Therefore, there are two different values for the same column name, which resulted in the creation of two seperate columns (x being from left, y being from right dataframe).
staff_df = pd.DataFrame([{'First Name': 'Myriam', 'Last Name': 'Schmuck', 'Role': 'Director of HR'},
{'First Name': 'Christina', 'Last Name': 'Zenker','Role': 'Course liasion'},
{'First Name': 'Thomas', 'Last Name': 'Huber', 'Role': 'Grader'}])
staff_df
student_df = pd.DataFrame([{'First Name': 'Thomas', 'Last Name': 'Mayer','School': 'Business'},
{'First Name': 'Reto', 'Last Name': 'Schmied', 'School': 'Law'},
{'First Name': 'Christina', 'Last Name': 'Zenker', 'School': 'Business'}])
student_df
Often a single value is not enough to merge correctly. In this example there are two rows with ”First Name”: ”Thomas”, but they are different people with different last names. Merging on ”First Name” is not good enough here!
Therefore, we specify that we want to merge both on the First and Last Name Columns. As Christina Zenker is the only one that exists in both datasets (we want an intersection), she will be added to the new dataframe.
pd.merge(student_df, staff_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])
< Previous Page | Home Page | Next Page >