Join two Dataframes based on multiple columns

In this article, we will discuss how to join two pandas DataFrames based on multiple columns. We are going to use pandas.merge function and will cover different scenarios as below.

Table of Contents

To quickly get started, let’s create two sample dataframe to experiment. We’ll use the pandas library with some random data.

import pandas as pd

# List of Tuples
employees = [('Shubham', 'India', 'Tech',   5),
            ('Riti', 'India', 'Tech' ,   7),
            ('Shanky', 'India', 'PMO' ,   2),
            ('Shreya', 'India', 'Design' ,   2),
            ('Aadi', 'US', 'Tech', 11),
            ('Sim', 'US', 'Tech', 4)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Location', 'Team', 'Experience'])
print(df)

Contents of the created dataframe are,

      Name Location    Team  Experience
0  Shubham    India    Tech           5
1     Riti    India    Tech           7
2   Shanky    India     PMO           2
3   Shreya    India  Design           2
4     Aadi       US    Tech          11
5      Sim       US    Tech           4

Let’s create another DataFrame with some sample data.

Advertisements
import pandas as pd

# random data
df1 = pd.DataFrame([['India', 'Tech', 40],
                   ['India', 'PMO', 26],
                   ['India', 'Design', 15],
                   ['US', 'Tech', 100]], columns = ['Location', 'Team', 'Salary'])

print(df1)

Contents of the created dataframe are,

  Location    Team  Salary
0    India    Tech      40
1    India     PMO      26
2    India  Design      15
3       US    Tech     100

Merge without any column keys mentioned

To merge these two DataFrames, we will be using pandas.merge function. It is very simple to use with minimal arguments, let’s try it below.

# inner join both the dataframes
print (df.merge(df1, how = 'inner'))

Output

      Name Location    Team  Experience  Salary
0  Shubham    India    Tech           5      40
1     Riti    India    Tech           7      40
2   Shanky    India     PMO           2      26
3   Shreya    India  Design           2      15
4     Aadi       US    Tech          11     100
5      Sim       US    Tech           4     100

The merge functionality searches for the common column headers between the two DataFrames, here it will be the columns Location and Team. Basis that, it has successfully merged the “Salary” information from df1 to df.

Merge with column keys mentioned

In cases where we directly want to specify the columns instead of letting the merge function figure it out, we can do that using the “on” attribute as below.

# left join both the dataframes
print (df.merge(df1, on = ['Location', 'Team'], how = 'left'))

Output

      Name Location    Team  Experience  Salary
0  Shubham    India    Tech           5      40
1     Riti    India    Tech           7      40
2   Shanky    India     PMO           2      26
3   Shreya    India  Design           2      15
4     Aadi       US    Tech          11     100
5      Sim       US    Tech           4     100

Note that the output still remains the same because there were no other common columns between these two DataFrames.

Merge with different column keys

Many a time, there can be cases where the column names are different between the two DataFrames but still, we want to merge the DataFrame on those columns. For demonstration, let’s change the column names of the second DataFrame as below.

# rename the columns
df1.rename(columns = {'Location':'employee_location', 'Team':'team'}, inplace=True)

print (df1)

Output

  employee_location    team  Salary
0             India    Tech      40
1             India     PMO      26
2             India  Design      15
3                US    Tech     100

Now to merge these two DataFrames, we might have to rename the DataFrames and then merge them using the above method. But luckily, the merge function can take care of this by using the different “on” attributes for left and right DataFrames.

# left join both the dataframes
print (df.merge(df1, left_on = ['Location', 'Team'], right_on = ['employee_location', 'team'], how='left'))

Output

      Name Location    Team  Experience employee_location    team  Salary
0  Shubham    India    Tech           5             India    Tech      40
1     Riti    India    Tech           7             India    Tech      40
2   Shanky    India     PMO           2             India     PMO      26
3   Shreya    India  Design           2             India  Design      15
4     Aadi       US    Tech          11                US    Tech     100
5      Sim       US    Tech           4                US    Tech     100

Here you go, it has merged the different named columns as well. However, this will keep columns from both the DataFrames, but they can be dropped separately.

The complete example is as follows,

import pandas as pd

# List of Tuples
employees = [('Shubham', 'India', 'Tech',   5),
            ('Riti', 'India', 'Tech' ,   7),
            ('Shanky', 'India', 'PMO' ,   2),
            ('Shreya', 'India', 'Design' ,   2),
            ('Aadi', 'US', 'Tech', 11),
            ('Sim', 'US', 'Tech', 4)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Location', 'Team', 'Experience'])
print(df)

# random data
df1 = pd.DataFrame([['India', 'Tech', 40],
                   ['India', 'PMO', 26],
                   ['India', 'Design', 15],
                   ['US', 'Tech', 100]], columns = ['Location', 'Team', 'Salary'])

print(df1)

# inner join both the dataframes
print (df.merge(df1, how = 'inner'))

# left join both the dataframes
print (df.merge(df1, on = ['Location', 'Team'], how = 'left'))

# rename the columns
df1.rename(columns = {'Location':'employee_location', 'Team':'team'}, inplace=True)

print (df1)

# left join both the dataframes
print (df.merge(df1, left_on = ['Location', 'Team'], right_on = ['employee_location', 'team'], how='left'))

Output:

      Name Location    Team  Experience
0  Shubham    India    Tech           5
1     Riti    India    Tech           7
2   Shanky    India     PMO           2
3   Shreya    India  Design           2
4     Aadi       US    Tech          11
5      Sim       US    Tech           4
  Location    Team  Salary
0    India    Tech      40
1    India     PMO      26
2    India  Design      15
3       US    Tech     100
      Name Location    Team  Experience  Salary
0  Shubham    India    Tech           5      40
1     Riti    India    Tech           7      40
2   Shanky    India     PMO           2      26
3   Shreya    India  Design           2      15
4     Aadi       US    Tech          11     100
5      Sim       US    Tech           4     100
      Name Location    Team  Experience  Salary
0  Shubham    India    Tech           5      40
1     Riti    India    Tech           7      40
2   Shanky    India     PMO           2      26
3   Shreya    India  Design           2      15
4     Aadi       US    Tech          11     100
5      Sim       US    Tech           4     100
  employee_location    team  Salary
0             India    Tech      40
1             India     PMO      26
2             India  Design      15
3                US    Tech     100
      Name Location  ...    team  Salary
0  Shubham    India  ...    Tech      40
1     Riti    India  ...    Tech      40
2   Shanky    India  ...     PMO      26
3   Shreya    India  ...  Design      15
4     Aadi       US  ...    Tech     100
5      Sim       US  ...    Tech     100

[6 rows x 7 columns]

Summary

In this article, we have discussed how to join two pandas Dataframes based on multiple columns.

Pandas Tutorials -Learn Data Analysis with Python

   

Are you looking to make a career in Data Science with Python?

Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.

Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.

Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.

Join a LinkedIn Community of Python Developers

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top