Pandas : Merge Dataframes on specific columns or on index in Python – Part 2

In this article we will discuss how to merge dataframes on given columns or index as Join keys.

First let’s get a little intro about Dataframe.merge() again,

Dataframe.merge()

In Python’s Pandas Library Dataframe class provides a function to merge Dataframes i.e.

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

It accepts a hell lot of arguments. Let’s discuss some of them,
Imp Arguments :

  • right : A dataframe or series to be merged with calling dataframe
  • how : Merge type, values are : left, right, outer, inner. Default is ‘inner’. If both dataframes has some different columns, then based on this value, it will be decided which columns will be in the merged dataframe.
  • on : Column name on which merge will be done. If not provided then merged on indexes.
  • left_on : Specific column names in left dataframe, on which merge will be done.
  • right_on : Specific column names in right dataframe, on which merge will be done.
  • left_index : bool (default False)
    • If True will choose index from left dataframe as join key.
  • right_index : bool (default False)
    • If True will choose index from right dataframe as join key.
  • suffixes : tuple of (str, str), default (‘_x’, ‘_y’)
    • Suffex to be applied on overlapping columns in left & right dataframes respectively.

In our previous article our focus was on merging using ‘how’ argument i.e. basically merging Dataframes by default on common columns using different join types. But in this article we will mainly focus on other arguments like what if don’t want to join an all common columns ? What if we want to join on some selected columns only? Let’s see some examples to understand this,

First of all, let’s create two dataframes to be merged.

Dataframe 1:
This dataframe contains the details of the employees like, ID, name, city, experience & Age i.e.

# List of Tuples
empoyees = [ (11, 'jack', 34, 'Sydney', 5) ,
           (12, 'Riti', 31, 'Delhi' , 7) ,
           (13, 'Aadi', 16, 'New York', 11) ,
           (14, 'Mohit', 32,'Delhi' , 15) ,
           (15, 'Veena', 33, 'Delhi' , 4) ,
           (16, 'Shaunak', 35, 'Mumbai', 5 ),
            (17, 'Shaun', 35, 'Colombo', 11)
            ]

# Create a DataFrame object
empDfObj = pd.DataFrame(empoyees, columns=['ID', 'Name', 'Age', 'City', 'Experience'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])

Contents of the first dataframe created are,

   ID     Name  Age      City  Experience
a  11     jack   34    Sydney           5
b  12     Riti   31     Delhi           7
c  13     Aadi   16  New York          11
d  14    Mohit   32     Delhi          15
e  15    Veena   33     Delhi           4
f  16  Shaunak   35    Mumbai           5
g  17    Shaun   35   Colombo          11

Dataframe 2:
This dataframe contains the details of the employees like, ID, salary, bonus and experience i.e.

# Create a DataFrame object
salaryDfObj = pd.DataFrame(salaries, columns=['ID', 'Experience', 'Salary', 'Bonus'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])

Contents of the second dataframe created are,

   ID Experience  Salary  Bonus
a  11     Junior   70000   1000
b  12     Senior   72200   1100
c  13     Expert   84999   1000
d  14     Expert   90000   2000
e  15     Junior   61000   1500
f  16     Junior   71000   1000
g  21     Senior   81000   2000

Merging Dataframe on a given column name as join key

In both the above dataframes two column names are common i.e. ID & Experience. But contents of Experience column in both the dataframes are of different types, one is int and other is string. There is no point in merging based on that column. By default if we don’t pass the on argument then Dataframe.merge() will merge it on both the columns ID & Experience as we saw in previous post i.e.
https://thispointer.com/pandas-how-to-merge-dataframes-using-dataframe-merge-in-python-part-1/

Which will not work here. Therefore, here we need to merge these two dataframes on a single column i.e. ID. To do that pass the ‘on’ argument in the Datfarame.merge() with column name on which we want to join / merge these 2 dataframes i.e.

# Merge two Dataframes on single column 'ID'
mergedDf = empDfObj.merge(salaryDfObj, on='ID')

Contents of the merged dataframe,

   ID     Name  Age      City  Experience_x Experience_y  Salary  Bonus
0  11     jack   34    Sydney             5       Junior   70000   1000
1  12     Riti   31     Delhi             7       Senior   72200   1100
2  13     Aadi   16  New York            11       Expert   84999   1000
3  14    Mohit   32     Delhi            15       Expert   90000   2000
4  15    Veena   33     Delhi             4       Junior   61000   1500
5  16  Shaunak   35    Mumbai             5       Junior   71000   1000

It merged both the above two dataframes on ‘ID’ column. As both the dataframes had a columns with name ‘Experience’, so both the columns were added with default suffix to differentiate between them i.e. Experience_x for column from Left Dataframe and Experience_y for column from Right Dataframe.

In you want to join on multiple columns instead of  a single column, then you can pass a list of column names to Dataframe.merge() instead of single column name. Also, as we didn’t specified the value of ‘how’ argument, therefore by default Dataframe.merge() uses inner join. You can also specify the join type using ‘how’ argument as explained in previous article i.e.

Pandas : How to Merge Dataframes using Dataframe.merge() in Python – Part 1

 

Merging Dataframe on a given column with suffix for similar column names

If there are some similar column names in both the dataframes which are not in join key then by default x & y is added as suffix to them.
Like in previous example merged dataframe contains Experience_x & Experience_y. Instead of default suffix, we can pass our custom suffix too i.e.

# Merge two Dataframes on column 'ID' with custom suffix for non-key common columns
mergedDf = empDfObj.merge(salaryDfObj, on='ID', suffixes=('_In_Years', '_Levels'))

Contents of the merged dataframe,

   ID     Name  Age      City  Experience_In_Years Experience_Levels  Salary  Bonus
0  11     jack   34    Sydney                    5            Junior   70000   1000
1  12     Riti   31     Delhi                    7            Senior   72200   1100
2  13     Aadi   16  New York                   11            Expert   84999   1000
3  14    Mohit   32     Delhi                   15            Expert   90000   2000
4  15    Veena   33     Delhi                    4            Junior   61000   1500
5  16  Shaunak   35    Mumbai                    5            Junior   71000   1000

We passed a tuple (‘_In_Years’, ‘_Levels’) in suffixes argument and ‘_In_Years’ is added to column from left dataframe and ‘_Levels’ is added to columns from right dataframe.

Merging Dataframe different columns

What if both the dataframes was completely different column names. For example let’s rename column ‘ID’ in dataframe 2 i.e.

# Rename column ID to EmpID
salaryDfObj.rename(columns={'ID': 'EmpID'}, inplace=True)

Now out second dataframe salaryDFObj contents are,

   EmpID Experience  Salary  Bonus
a     11     Junior   70000   1000
b     12     Senior   72200   1100
c     13     Expert   84999   1000
d     14     Expert   90000   2000
e     15     Junior   61000   1500
f     16     Junior   71000   1000
g     21     Senior   81000   2000

Now let’s see how to merge these two dataframes on ‘ID‘ column from Dataframe 1 and ‘EmpID‘ column from dataframe 2 i.e.

# Merge two Dataframes on different columns
mergedDf = empDfObj.merge(salaryDfObj, left_on='ID', right_on='EmpID')

Contents of the merged dataframe,

   ID     Name  Age      City  Experience_x  EmpID Experience_y  Salary  Bonus
0  11     jack   34    Sydney             5     11       Junior   70000   1000
1  12     Riti   31     Delhi             7     12       Senior   72200   1100
2  13     Aadi   16  New York            11     13       Expert   84999   1000
3  14    Mohit   32     Delhi            15     14       Expert   90000   2000
4  15    Veena   33     Delhi             4     15       Junior   61000   1500
5  16  Shaunak   35    Mumbai             5     16       Junior   71000   1000

Till now we have seen merging on columns either by default on specifically given columns. But what if we want to merge in index of both the dataframe ?

We will discuss that in next article.

Complete example is as follows,

import pandas as pd

def main():
    pd.set_option('display.max_rows', 100)
    pd.set_option('display.max_columns', 100)
    pd.set_option('display.width', 1500)

    print('*** Creating Dataframe 1 ***')
    # List of Tuples
    empoyees = [ (11, 'jack', 34, 'Sydney', 5) ,
               (12, 'Riti', 31, 'Delhi' , 7) ,
               (13, 'Aadi', 16, 'New York', 11) ,
               (14, 'Mohit', 32,'Delhi' , 15) ,
               (15, 'Veena', 33, 'Delhi' , 4) ,
               (16, 'Shaunak', 35, 'Mumbai', 5 ),
                (17, 'Shaun', 35, 'Colombo', 11)
                ]

    # Create a DataFrame object
    empDfObj = pd.DataFrame(empoyees, columns=['ID', 'Name', 'Age', 'City', 'Experience'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
    print("Dataframe 1 : ")
    print(empDfObj)

    print('*** Creating Dataframe 2 ***')

    # List of Tuples
    salaries = [(11, 'Junior', 70000, 1000) ,
               (12, 'Senior', 72200, 1100) ,
               (13, 'Expert', 84999, 1000) ,
               (14, 'Expert', 90000, 2000) ,
               (15, 'Junior', 61000, 1500) ,
               (16, 'Junior', 71000, 1000),
               (21, 'Senior',81000, 2000)
                ]

    # Create a DataFrame object
    salaryDfObj = pd.DataFrame(salaries, columns=['ID', 'Experience', 'Salary', 'Bonus'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
    print("Dataframe 2 : ")
    print(salaryDfObj)


    print('**** Merge two Dataframes on a single Column ****')

    # Merge two Dataframes on single column 'ID'
    mergedDf = empDfObj.merge(salaryDfObj, on='ID')
    print('Contents of the Merged Dataframe :')
    print(mergedDf)

    print('**** Merge two Dataframes with custom suffix ****')

    # Merge two Dataframes on column 'ID' with custom suffix for non-key common columns
    mergedDf = empDfObj.merge(salaryDfObj, on='ID', suffixes=('_In_Years', '_Levels'))
    print('Contents of the Merged Dataframe :')
    print(mergedDf)

    # Rename column ID to EmpID
    salaryDfObj.rename(columns={'ID': 'EmpID'}, inplace=True)
    print("Contents of the Dataframe 2 : ")
    print(salaryDfObj)

    print('**** Merge two Dataframes on different columns ****')

    # Merge two Dataframes on different columns
    mergedDf = empDfObj.merge(salaryDfObj, left_on='ID', right_on='EmpID')
    print('Contents of the Merged Dataframe :')
    print(mergedDf)

if __name__ == '__main__':
  main()

Output:

*** Creating Dataframe 1 ***
Dataframe 1 : 
   ID     Name  Age      City  Experience
a  11     jack   34    Sydney           5
b  12     Riti   31     Delhi           7
c  13     Aadi   16  New York          11
d  14    Mohit   32     Delhi          15
e  15    Veena   33     Delhi           4
f  16  Shaunak   35    Mumbai           5
g  17    Shaun   35   Colombo          11
*** Creating Dataframe 2 ***
Dataframe 2 : 
   ID Experience  Salary  Bonus
a  11     Junior   70000   1000
b  12     Senior   72200   1100
c  13     Expert   84999   1000
d  14     Expert   90000   2000
e  15     Junior   61000   1500
f  16     Junior   71000   1000
g  21     Senior   81000   2000
**** Merge two Dataframes on a single Column ****
Contents of the Merged Dataframe :
   ID     Name  Age      City  Experience_x Experience_y  Salary  Bonus
0  11     jack   34    Sydney             5       Junior   70000   1000
1  12     Riti   31     Delhi             7       Senior   72200   1100
2  13     Aadi   16  New York            11       Expert   84999   1000
3  14    Mohit   32     Delhi            15       Expert   90000   2000
4  15    Veena   33     Delhi             4       Junior   61000   1500
5  16  Shaunak   35    Mumbai             5       Junior   71000   1000
**** Merge two Dataframes with custom suffix ****
Contents of the Merged Dataframe :
   ID     Name  Age      City  Experience_In_Years Experience_Levels  Salary  Bonus
0  11     jack   34    Sydney                    5            Junior   70000   1000
1  12     Riti   31     Delhi                    7            Senior   72200   1100
2  13     Aadi   16  New York                   11            Expert   84999   1000
3  14    Mohit   32     Delhi                   15            Expert   90000   2000
4  15    Veena   33     Delhi                    4            Junior   61000   1500
5  16  Shaunak   35    Mumbai                    5            Junior   71000   1000
Contents of the Dataframe 2 : 
   EmpID Experience  Salary  Bonus
a     11     Junior   70000   1000
b     12     Senior   72200   1100
c     13     Expert   84999   1000
d     14     Expert   90000   2000
e     15     Junior   61000   1500
f     16     Junior   71000   1000
g     21     Senior   81000   2000
**** Merge two Dataframes on different columns ****
Contents of the Merged Dataframe :
   ID     Name  Age      City  Experience_x  EmpID Experience_y  Salary  Bonus
0  11     jack   34    Sydney             5     11       Junior   70000   1000
1  12     Riti   31     Delhi             7     12       Senior   72200   1100
2  13     Aadi   16  New York            11     13       Expert   84999   1000
3  14    Mohit   32     Delhi            15     14       Expert   90000   2000
4  15    Veena   33     Delhi             4     15       Junior   61000   1500
5  16  Shaunak   35    Mumbai             5     16       Junior   71000   1000

 

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top