In this article we will discuss how to merge dataframes on given columns or index as Join keys.
First let’s get a little intro about Dataframe.merge() again,
Dataframe.merge()
In Python’s Pandas Library Dataframe class provides a function to merge Dataframes i.e.
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
It accepts a hell lot of arguments. Let’s discuss some of them,
Imp Arguments :
- right : A dataframe or series to be merged with calling dataframe
- how : Merge type, values are : left, right, outer, inner. Default is ‘inner’. If both dataframes has some different columns, then based on this value, it will be decided which columns will be in the merged dataframe.
- on : Column name on which merge will be done. If not provided then merged on indexes.
- left_on : Specific column names in left dataframe, on which merge will be done.
- right_on : Specific column names in right dataframe, on which merge will be done.
- left_index : bool (default False)
- If True will choose index from left dataframe as join key.
- right_index : bool (default False)
- If True will choose index from right dataframe as join key.
- suffixes : tuple of (str, str), default (‘_x’, ‘_y’)
- Suffex to be applied on overlapping columns in left & right dataframes respectively.
In our previous article our focus was on merging using ‘how’ argument i.e. basically merging Dataframes by default on common columns using different join types. But in this article we will mainly focus on other arguments like what if don’t want to join an all common columns ? What if we want to join on some selected columns only? Let’s see some examples to understand this,
First of all, let’s create two dataframes to be merged.
Frequently Asked:
Dataframe 1:
This dataframe contains the details of the employees like, ID, name, city, experience & Age i.e.
# List of Tuples empoyees = [ (11, 'jack', 34, 'Sydney', 5) , (12, 'Riti', 31, 'Delhi' , 7) , (13, 'Aadi', 16, 'New York', 11) , (14, 'Mohit', 32,'Delhi' , 15) , (15, 'Veena', 33, 'Delhi' , 4) , (16, 'Shaunak', 35, 'Mumbai', 5 ), (17, 'Shaun', 35, 'Colombo', 11) ] # Create a DataFrame object empDfObj = pd.DataFrame(empoyees, columns=['ID', 'Name', 'Age', 'City', 'Experience'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
Contents of the first dataframe created are,
ID Name Age City Experience a 11 jack 34 Sydney 5 b 12 Riti 31 Delhi 7 c 13 Aadi 16 New York 11 d 14 Mohit 32 Delhi 15 e 15 Veena 33 Delhi 4 f 16 Shaunak 35 Mumbai 5 g 17 Shaun 35 Colombo 11
Dataframe 2:
This dataframe contains the details of the employees like, ID, salary, bonus and experience i.e.
# Create a DataFrame object salaryDfObj = pd.DataFrame(salaries, columns=['ID', 'Experience', 'Salary', 'Bonus'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
Contents of the second dataframe created are,
ID Experience Salary Bonus a 11 Junior 70000 1000 b 12 Senior 72200 1100 c 13 Expert 84999 1000 d 14 Expert 90000 2000 e 15 Junior 61000 1500 f 16 Junior 71000 1000 g 21 Senior 81000 2000
Merging Dataframe on a given column name as join key
In both the above dataframes two column names are common i.e. ID & Experience. But contents of Experience column in both the dataframes are of different types, one is int and other is string. There is no point in merging based on that column. By default if we don’t pass the on argument then Dataframe.merge() will merge it on both the columns ID & Experience as we saw in previous post i.e.
https://thispointer.com/pandas-how-to-merge-dataframes-using-dataframe-merge-in-python-part-1/
Which will not work here. Therefore, here we need to merge these two dataframes on a single column i.e. ID. To do that pass the ‘on’ argument in the Datfarame.merge() with column name on which we want to join / merge these 2 dataframes i.e.
# Merge two Dataframes on single column 'ID' mergedDf = empDfObj.merge(salaryDfObj, on='ID')
Contents of the merged dataframe,
ID Name Age City Experience_x Experience_y Salary Bonus 0 11 jack 34 Sydney 5 Junior 70000 1000 1 12 Riti 31 Delhi 7 Senior 72200 1100 2 13 Aadi 16 New York 11 Expert 84999 1000 3 14 Mohit 32 Delhi 15 Expert 90000 2000 4 15 Veena 33 Delhi 4 Junior 61000 1500 5 16 Shaunak 35 Mumbai 5 Junior 71000 1000
It merged both the above two dataframes on ‘ID’ column. As both the dataframes had a columns with name ‘Experience’, so both the columns were added with default suffix to differentiate between them i.e. Experience_x for column from Left Dataframe and Experience_y for column from Right Dataframe.
In you want to join on multiple columns instead of a single column, then you can pass a list of column names to Dataframe.merge() instead of single column name. Also, as we didn’t specified the value of ‘how’ argument, therefore by default Dataframe.merge() uses inner join. You can also specify the join type using ‘how’ argument as explained in previous article i.e.
Pandas : How to Merge Dataframes using Dataframe.merge() in Python – Part 1
Merging Dataframe on a given column with suffix for similar column names
If there are some similar column names in both the dataframes which are not in join key then by default x & y is added as suffix to them.
Like in previous example merged dataframe contains Experience_x & Experience_y. Instead of default suffix, we can pass our custom suffix too i.e.
# Merge two Dataframes on column 'ID' with custom suffix for non-key common columns mergedDf = empDfObj.merge(salaryDfObj, on='ID', suffixes=('_In_Years', '_Levels'))
Contents of the merged dataframe,
ID Name Age City Experience_In_Years Experience_Levels Salary Bonus 0 11 jack 34 Sydney 5 Junior 70000 1000 1 12 Riti 31 Delhi 7 Senior 72200 1100 2 13 Aadi 16 New York 11 Expert 84999 1000 3 14 Mohit 32 Delhi 15 Expert 90000 2000 4 15 Veena 33 Delhi 4 Junior 61000 1500 5 16 Shaunak 35 Mumbai 5 Junior 71000 1000
We passed a tuple (‘_In_Years’, ‘_Levels’) in suffixes argument and ‘_In_Years’ is added to column from left dataframe and ‘_Levels’ is added to columns from right dataframe.
Merging Dataframe different columns
What if both the dataframes was completely different column names. For example let’s rename column ‘ID’ in dataframe 2 i.e.
# Rename column ID to EmpID salaryDfObj.rename(columns={'ID': 'EmpID'}, inplace=True)
Now out second dataframe salaryDFObj contents are,
EmpID Experience Salary Bonus a 11 Junior 70000 1000 b 12 Senior 72200 1100 c 13 Expert 84999 1000 d 14 Expert 90000 2000 e 15 Junior 61000 1500 f 16 Junior 71000 1000 g 21 Senior 81000 2000
Now let’s see how to merge these two dataframes on ‘ID‘ column from Dataframe 1 and ‘EmpID‘ column from dataframe 2 i.e.
# Merge two Dataframes on different columns mergedDf = empDfObj.merge(salaryDfObj, left_on='ID', right_on='EmpID')
Contents of the merged dataframe,
ID Name Age City Experience_x EmpID Experience_y Salary Bonus 0 11 jack 34 Sydney 5 11 Junior 70000 1000 1 12 Riti 31 Delhi 7 12 Senior 72200 1100 2 13 Aadi 16 New York 11 13 Expert 84999 1000 3 14 Mohit 32 Delhi 15 14 Expert 90000 2000 4 15 Veena 33 Delhi 4 15 Junior 61000 1500 5 16 Shaunak 35 Mumbai 5 16 Junior 71000 1000
Till now we have seen merging on columns either by default on specifically given columns. But what if we want to merge in index of both the dataframe ?
We will discuss that in next article.
Complete example is as follows,
import pandas as pd def main(): pd.set_option('display.max_rows', 100) pd.set_option('display.max_columns', 100) pd.set_option('display.width', 1500) print('*** Creating Dataframe 1 ***') # List of Tuples empoyees = [ (11, 'jack', 34, 'Sydney', 5) , (12, 'Riti', 31, 'Delhi' , 7) , (13, 'Aadi', 16, 'New York', 11) , (14, 'Mohit', 32,'Delhi' , 15) , (15, 'Veena', 33, 'Delhi' , 4) , (16, 'Shaunak', 35, 'Mumbai', 5 ), (17, 'Shaun', 35, 'Colombo', 11) ] # Create a DataFrame object empDfObj = pd.DataFrame(empoyees, columns=['ID', 'Name', 'Age', 'City', 'Experience'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g']) print("Dataframe 1 : ") print(empDfObj) print('*** Creating Dataframe 2 ***') # List of Tuples salaries = [(11, 'Junior', 70000, 1000) , (12, 'Senior', 72200, 1100) , (13, 'Expert', 84999, 1000) , (14, 'Expert', 90000, 2000) , (15, 'Junior', 61000, 1500) , (16, 'Junior', 71000, 1000), (21, 'Senior',81000, 2000) ] # Create a DataFrame object salaryDfObj = pd.DataFrame(salaries, columns=['ID', 'Experience', 'Salary', 'Bonus'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g']) print("Dataframe 2 : ") print(salaryDfObj) print('**** Merge two Dataframes on a single Column ****') # Merge two Dataframes on single column 'ID' mergedDf = empDfObj.merge(salaryDfObj, on='ID') print('Contents of the Merged Dataframe :') print(mergedDf) print('**** Merge two Dataframes with custom suffix ****') # Merge two Dataframes on column 'ID' with custom suffix for non-key common columns mergedDf = empDfObj.merge(salaryDfObj, on='ID', suffixes=('_In_Years', '_Levels')) print('Contents of the Merged Dataframe :') print(mergedDf) # Rename column ID to EmpID salaryDfObj.rename(columns={'ID': 'EmpID'}, inplace=True) print("Contents of the Dataframe 2 : ") print(salaryDfObj) print('**** Merge two Dataframes on different columns ****') # Merge two Dataframes on different columns mergedDf = empDfObj.merge(salaryDfObj, left_on='ID', right_on='EmpID') print('Contents of the Merged Dataframe :') print(mergedDf) if __name__ == '__main__': main()
Output:
*** Creating Dataframe 1 *** Dataframe 1 : ID Name Age City Experience a 11 jack 34 Sydney 5 b 12 Riti 31 Delhi 7 c 13 Aadi 16 New York 11 d 14 Mohit 32 Delhi 15 e 15 Veena 33 Delhi 4 f 16 Shaunak 35 Mumbai 5 g 17 Shaun 35 Colombo 11 *** Creating Dataframe 2 *** Dataframe 2 : ID Experience Salary Bonus a 11 Junior 70000 1000 b 12 Senior 72200 1100 c 13 Expert 84999 1000 d 14 Expert 90000 2000 e 15 Junior 61000 1500 f 16 Junior 71000 1000 g 21 Senior 81000 2000 **** Merge two Dataframes on a single Column **** Contents of the Merged Dataframe : ID Name Age City Experience_x Experience_y Salary Bonus 0 11 jack 34 Sydney 5 Junior 70000 1000 1 12 Riti 31 Delhi 7 Senior 72200 1100 2 13 Aadi 16 New York 11 Expert 84999 1000 3 14 Mohit 32 Delhi 15 Expert 90000 2000 4 15 Veena 33 Delhi 4 Junior 61000 1500 5 16 Shaunak 35 Mumbai 5 Junior 71000 1000 **** Merge two Dataframes with custom suffix **** Contents of the Merged Dataframe : ID Name Age City Experience_In_Years Experience_Levels Salary Bonus 0 11 jack 34 Sydney 5 Junior 70000 1000 1 12 Riti 31 Delhi 7 Senior 72200 1100 2 13 Aadi 16 New York 11 Expert 84999 1000 3 14 Mohit 32 Delhi 15 Expert 90000 2000 4 15 Veena 33 Delhi 4 Junior 61000 1500 5 16 Shaunak 35 Mumbai 5 Junior 71000 1000 Contents of the Dataframe 2 : EmpID Experience Salary Bonus a 11 Junior 70000 1000 b 12 Senior 72200 1100 c 13 Expert 84999 1000 d 14 Expert 90000 2000 e 15 Junior 61000 1500 f 16 Junior 71000 1000 g 21 Senior 81000 2000 **** Merge two Dataframes on different columns **** Contents of the Merged Dataframe : ID Name Age City Experience_x EmpID Experience_y Salary Bonus 0 11 jack 34 Sydney 5 11 Junior 70000 1000 1 12 Riti 31 Delhi 7 12 Senior 72200 1100 2 13 Aadi 16 New York 11 13 Expert 84999 1000 3 14 Mohit 32 Delhi 15 14 Expert 90000 2000 4 15 Veena 33 Delhi 4 15 Junior 61000 1500 5 16 Shaunak 35 Mumbai 5 16 Junior 71000 1000