In this article, we will discuss how to join two pandas DataFrames based on multiple columns. We are going to use pandas.merge function and will cover different scenarios as below.
Table of Contents
To quickly get started, let’s create two sample dataframe to experiment. We’ll use the pandas library with some random data.
import pandas as pd # List of Tuples employees = [('Shubham', 'India', 'Tech', 5), ('Riti', 'India', 'Tech' , 7), ('Shanky', 'India', 'PMO' , 2), ('Shreya', 'India', 'Design' , 2), ('Aadi', 'US', 'Tech', 11), ('Sim', 'US', 'Tech', 4)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Location', 'Team', 'Experience']) print(df)
Contents of the created dataframe are,
Name Location Team Experience 0 Shubham India Tech 5 1 Riti India Tech 7 2 Shanky India PMO 2 3 Shreya India Design 2 4 Aadi US Tech 11 5 Sim US Tech 4
Let’s create another DataFrame with some sample data.
import pandas as pd # random data df1 = pd.DataFrame([['India', 'Tech', 40], ['India', 'PMO', 26], ['India', 'Design', 15], ['US', 'Tech', 100]], columns = ['Location', 'Team', 'Salary']) print(df1)
Contents of the created dataframe are,
Location Team Salary 0 India Tech 40 1 India PMO 26 2 India Design 15 3 US Tech 100
Merge without any column keys mentioned
To merge these two DataFrames, we will be using pandas.merge function. It is very simple to use with minimal arguments, let’s try it below.
Frequently Asked:
- Select Rows by Index Names in Pandas
- How to add Pandas DataFrame to an existing csv file?
- Change the Order of Columns in Pandas DataFrame
- Change Column Names in Pandas Dataframe
# inner join both the dataframes print (df.merge(df1, how = 'inner'))
Output
Name Location Team Experience Salary 0 Shubham India Tech 5 40 1 Riti India Tech 7 40 2 Shanky India PMO 2 26 3 Shreya India Design 2 15 4 Aadi US Tech 11 100 5 Sim US Tech 4 100
The merge functionality searches for the common column headers between the two DataFrames, here it will be the columns Location and Team. Basis that, it has successfully merged the “Salary” information from df1 to df.
Merge with column keys mentioned
In cases where we directly want to specify the columns instead of letting the merge function figure it out, we can do that using the “on” attribute as below.
# left join both the dataframes print (df.merge(df1, on = ['Location', 'Team'], how = 'left'))
Output
Name Location Team Experience Salary 0 Shubham India Tech 5 40 1 Riti India Tech 7 40 2 Shanky India PMO 2 26 3 Shreya India Design 2 15 4 Aadi US Tech 11 100 5 Sim US Tech 4 100
Note that the output still remains the same because there were no other common columns between these two DataFrames.
Merge with different column keys
Many a time, there can be cases where the column names are different between the two DataFrames but still, we want to merge the DataFrame on those columns. For demonstration, let’s change the column names of the second DataFrame as below.
# rename the columns df1.rename(columns = {'Location':'employee_location', 'Team':'team'}, inplace=True) print (df1)
Output
employee_location team Salary 0 India Tech 40 1 India PMO 26 2 India Design 15 3 US Tech 100
Now to merge these two DataFrames, we might have to rename the DataFrames and then merge them using the above method. But luckily, the merge function can take care of this by using the different “on” attributes for left and right DataFrames.
# left join both the dataframes print (df.merge(df1, left_on = ['Location', 'Team'], right_on = ['employee_location', 'team'], how='left'))
Output
Name Location Team Experience employee_location team Salary 0 Shubham India Tech 5 India Tech 40 1 Riti India Tech 7 India Tech 40 2 Shanky India PMO 2 India PMO 26 3 Shreya India Design 2 India Design 15 4 Aadi US Tech 11 US Tech 100 5 Sim US Tech 4 US Tech 100
Here you go, it has merged the different named columns as well. However, this will keep columns from both the DataFrames, but they can be dropped separately.
The complete example is as follows,
import pandas as pd # List of Tuples employees = [('Shubham', 'India', 'Tech', 5), ('Riti', 'India', 'Tech' , 7), ('Shanky', 'India', 'PMO' , 2), ('Shreya', 'India', 'Design' , 2), ('Aadi', 'US', 'Tech', 11), ('Sim', 'US', 'Tech', 4)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Location', 'Team', 'Experience']) print(df) # random data df1 = pd.DataFrame([['India', 'Tech', 40], ['India', 'PMO', 26], ['India', 'Design', 15], ['US', 'Tech', 100]], columns = ['Location', 'Team', 'Salary']) print(df1) # inner join both the dataframes print (df.merge(df1, how = 'inner')) # left join both the dataframes print (df.merge(df1, on = ['Location', 'Team'], how = 'left')) # rename the columns df1.rename(columns = {'Location':'employee_location', 'Team':'team'}, inplace=True) print (df1) # left join both the dataframes print (df.merge(df1, left_on = ['Location', 'Team'], right_on = ['employee_location', 'team'], how='left'))
Output:
Name Location Team Experience 0 Shubham India Tech 5 1 Riti India Tech 7 2 Shanky India PMO 2 3 Shreya India Design 2 4 Aadi US Tech 11 5 Sim US Tech 4 Location Team Salary 0 India Tech 40 1 India PMO 26 2 India Design 15 3 US Tech 100 Name Location Team Experience Salary 0 Shubham India Tech 5 40 1 Riti India Tech 7 40 2 Shanky India PMO 2 26 3 Shreya India Design 2 15 4 Aadi US Tech 11 100 5 Sim US Tech 4 100 Name Location Team Experience Salary 0 Shubham India Tech 5 40 1 Riti India Tech 7 40 2 Shanky India PMO 2 26 3 Shreya India Design 2 15 4 Aadi US Tech 11 100 5 Sim US Tech 4 100 employee_location team Salary 0 India Tech 40 1 India PMO 26 2 India Design 15 3 US Tech 100 Name Location ... team Salary 0 Shubham India ... Tech 40 1 Riti India ... Tech 40 2 Shanky India ... PMO 26 3 Shreya India ... Design 15 4 Aadi US ... Tech 100 5 Sim US ... Tech 100 [6 rows x 7 columns]
Summary
In this article, we have discussed how to join two pandas Dataframes based on multiple columns.