In this tutorial, we will discuss how to join, merge or concatenate DataFrames in Pandas.
Table Of Contents
Concatenating DataFrames
We can concatenate one or more DataFrames along the Rows & Columns using the concat() function.
Concatenate DataFrames along the Rows
Let’s first create three different DataFrames with similar column names but different index labels i.e.
import pandas as pd # List of Tuples data1= [('Jack', 34, 'Sydney', 5) , ('Riti', 31, 'Delhi' , 7) , ('Aadi', 46, 'New York', 11)] # List of Tuples data2= [('Mohit', 34, 'Tokyo', 11) , ('Veena', 31, 'London' , 10) , ('Shaun', 36, 'Las Vegas', 12)] # List of Tuples data3= [('Mark', 47, 'Mumbai', 13) , ('Jose', 43, 'Yokohama', 14) , ('Ramu', 49, 'Paris', 15)] # Create a DataFrame object from list of tuples firstDf = pd.DataFrame( data1, columns=['Name', 'Age', 'City', 'Experience'], index = ['a', 'b', 'c']) print('First DataFrame:') # Display the First DataFrame print(firstDf) # Create a DataFrame object from list of tuples secondDF = pd.DataFrame(data2, columns=['Name', 'Age', 'City', 'Experience'], index = ['d', 'e', 'f']) print('Second DataFrame:') # Display the second DataFrame print(secondDF) # Create a DataFrame object from list of tuples thirdDF = pd.DataFrame( data3, columns=['Name', 'Age', 'City', 'Experience'], index = ['g', 'h', 'i']) print('Third DataFrame:') # Display the third DataFrame print(thirdDF)
Output:
Frequently Asked:
First DataFrame: Name Age City Experience a Jack 34 Sydney 5 b Riti 31 Delhi 7 c Aadi 46 New York 11 Second DataFrame: Name Age City Experience d Mohit 34 Tokyo 11 e Veena 31 London 10 f Shaun 36 Las Vegas 12 Third DataFrame: Name Age City Experience g Mark 47 Mumbai 13 h Jose 43 Yokohama 14 i Ramu 49 Paris 15
All the three DataFrames have same number of columns.
In Python, the Pandas library provides a function concat() to concatenate two or more DataFrames along a particular axis. But the dimensions must match be same along the axis of concatenation. For example, all the three DataFrames created above have similar columns, so we can easily concatenate them along the Rows. For example,
# Concatenate three DataFrames along the Rows df = pd.concat([firstDf, secondDF, thirdDF]) # Display the Concatenated DataFrame print(df)
Output:
Name Age City Experience a Jack 34 Sydney 5 b Riti 31 Delhi 7 c Aadi 46 New York 11 d Mohit 34 Tokyo 11 e Veena 31 London 10 f Shaun 36 Las Vegas 12 g Mark 47 Mumbai 13 h Jose 43 Yokohama 14 i Ramu 49 Paris 15
We passed the list of DataFrames in the concat() function and the default value of axis argument is 0 i.e. axis=0. If axis is 0, it means that DataFrames will be merged along the rows i.e. rows of all the DataFrames will be added on top of each other.
Latest Python - Video Tutorial
Concatenate DataFrames along the Columns
To concatenate the DataFrames along the columns, we need to pass axis=1 in the conact() function. For example,
# Concatenate three DataFrames along the Columns df = pd.concat([firstDf, secondDF, thirdDF], axis=1) # Display the Concatenated DataFrame print(df)
Output:
Name Age City Experience Name ... Experience Name Age City Experience a Jack 34.0 Sydney 5.0 NaN ... NaN NaN NaN NaN NaN b Riti 31.0 Delhi 7.0 NaN ... NaN NaN NaN NaN NaN c Aadi 46.0 New York 11.0 NaN ... NaN NaN NaN NaN NaN d NaN NaN NaN NaN Mohit ... 11.0 NaN NaN NaN NaN e NaN NaN NaN NaN Veena ... 10.0 NaN NaN NaN NaN f NaN NaN NaN NaN Shaun ... 12.0 NaN NaN NaN NaN g NaN NaN NaN NaN NaN ... NaN Mark 47.0 Mumbai 13.0 h NaN NaN NaN NaN NaN ... NaN Jose 43.0 Yokohama 14.0 i NaN NaN NaN NaN NaN ... NaN Ramu 49.0 Paris 15.0 [9 rows x 12 columns]
As we concatenated the DataFrames along the columns and all the DataFrames do not have similar Index labels, therefore for missing values the NaN values got Added. Like,
- The first DataFrame had index labels ‘a’, ‘b’ and ‘c’ but it didn’t have index labels from ‘d’ to ‘i’. Therefore, for missing index labels, NaN values got added.
- The second DataFrame had index labels ‘d’, ‘e’ and ‘f’ but it didn’t have index labels from ‘a’ to ‘c’ and ‘g’ to ‘i’. Therefore, for missing index labels, NaN values got added.
- The third DataFrame had index labels ‘g’, ‘h’ and ‘i’ but it didn’t have index labels from ‘a’ to ‘f’. Therefore, for missing index labels, NaN values got added.
Therefore, it is essential to make sure the dimensions of DataFrame is correct along the axis of concatenation. Like in above example, we merged along the columns but all the three DataFrames didn’t have similar index labels. Due to which so many NaN values got added.
Merging DataFrames
In Pandas, the DataFrame provides a function merge() to merge the DataFrames. Let’s start with examples,
Suppose we have two DataFrames,
Dataframe 1:
This DataFrame contains the details of the employees like, ID, name, city, experience & Age i.e.
import pandas as pd # List of Tuples empoyees = [(11, 'jack', 34, 'Sydney', 5) , (12, 'Riti', 31, 'Delhi' , 7) , (13, 'Aadi', 16, 'New York', 11) , (14, 'Mohit', 32,'Delhi' , 15) , (15, 'Veena', 33, 'Delhi' , 4) , (16, 'Shaunak', 35, 'Mumbai', 5 ), (17, 'Shaun', 35, 'Colombo', 11)] # Create a DataFrame object empDfObj = pd.DataFrame(empoyees, columns=['ID', 'Name', 'Age', 'City', 'Experience'], index=['a', 'b', 'c', 'd', 'e', 'f', 'h']) print("Dataframe 1 : ") print(empDfObj)
Contents of the first dataframe created are,
Dataframe 1 : ID Name Age City Experience a 11 jack 34 Sydney 5 b 12 Riti 31 Delhi 7 c 13 Aadi 16 New York 11 d 14 Mohit 32 Delhi 15 e 15 Veena 33 Delhi 4 f 16 Shaunak 35 Mumbai 5 h 17 Shaun 35 Colombo 11
Dataframe 2:
This DataFrame contains the details of the employees like, ID, salary, bonus and experience i.e.
import pandas as pd # List of Tuples salaries = [(11, 5, 70000, 1000) , (12, 7, 72200, 1100) , (13, 11, 84999, 1000) , (14, 15, 90000, 2000) , (15, 4, 61000, 1500) , (16, 5, 71000, 1000), (21, 10,81000, 2000)] # Create a DataFrame object salaryDfObj = pd.DataFrame( salaries, columns=['ID', 'Experience' , 'Salary', 'Bonus'], index=['a', 'b', 'c', 'd', 'e', 'f', 'h']) print("Dataframe 2 : ") print(salaryDfObj)
Contents of the second dataframe created are,
Dataframe 2 : ID Experience Salary Bonus a 11 5 70000 1000 b 12 7 72200 1100 c 13 11 84999 1000 d 14 15 90000 2000 e 15 4 61000 1500 f 16 5 71000 1000 h 21 10 81000 2000
Now let’s see how to merge these DataFrames along the Columns.
Merge DataFrames along columns (Default Inner Join)
Our two DataFrames have 2 common column names i.e. ‘ID’ & ‘Experience’. If we directly call Dataframe.merge() on these two DataFrames, without any additional arguments. Then it will merge the columns of the both the DataFrames for the rows in which values of ‘ID’ & ‘Experience’ are equal i.e.
# Merge two Dataframes on common columns using default inner join mergedDf = empDfObj.merge(salaryDfObj) print('Contents of the Merged Dataframe :') print(mergedDf)
Merged Dataframe contents are:
ID Name Age City Experience Salary Bonus 0 11 jack 34 Sydney 5 70000 1000 1 12 Riti 31 Delhi 7 72200 1100 2 14 Mohit 32 Delhi 15 90000 2000 3 15 Veena 33 Delhi 4 61000 1500
It merged the contents of the both the DataFrames, but it picked only those rows where value of both the key columns “ID” & “Experience” were same. All other rows got skipped. It happened because by default “inner” join was used and key columns were “ID” and “Experience”.
In the above example,
Key column value combinations that were unique in DataFrame empDfObj (left DataFrame) were,
- ID 13 and Experience 11
- ID 16 and Experience 5
- ID 17 and Experience 11
Key column value combinations that were unique in DataFrame salaryDfObj (right DataFrame) were,
- ID 13 and Experience 33
- ID 16 and Experience 14
- ID 21 and Experience 10
As by default key is the common column names. Therefore key in our case was a combination of column “ID” and “Experience”. Therefore the above mentioned unique rows from both the DataFrames got skipped while merging.
So basically by default merge() function uses the Inner Join and it does the intersaction of keys in both the DataFrames. Results will be same if we pass how argument with value ‘inner’ i.e.
# Merge two Dataframes on common columns using default inner join mergedDf = empDfObj.merge(salaryDfObj, how='inner') print('Contents of the Merged Dataframe :') print(mergedDf)
Merge DataFrames along columns Using Left Join
What is Left Join ?
If Merging the two Dataframes on Key Columns and the Join type is “Left Join” then,
- Include all rows from the Left DataFrame.
- For the key combinations which are unique to Left DataFrame only, add NaN values for the columns included from Right DataFrame.
- Skip rows from the Right DataFrame which have unique values in the key combinations.
By default key is the combination of common column names. Like in our case it key will be a combination of column “ID” and “Experience”. In this example if we will pass how argument with value ‘left’. It will merge two datframes on key columns using left join,
# Merge two Dataframes on common columns using Left join mergedDf = empDfObj.merge(salaryDfObj, how='left') print('Contents of the Merged Dataframe :') print(mergedDf)
Contents of the merged dataframe :
ID Name Age City Experience Salary Bonus 0 11 jack 34 Sydney 5 70000.0 1000.0 1 12 Riti 31 Delhi 7 72200.0 1100.0 2 13 Aadi 16 New York 11 NaN NaN 3 14 Mohit 32 Delhi 15 90000.0 2000.0 4 15 Veena 33 Delhi 4 61000.0 1500.0 5 16 Shaunak 35 Mumbai 5 NaN NaN 6 17 Shaun 35 Colombo 11 NaN NaN
It picked all rows from left dataframe (empDfObj) and but there were 3 key unique column combinations in the Dataframe empDfObj (left DataFrame) i.e.
- ID 13 and Experience 11
- ID 16 and Experience 5
- ID 17 and Experience 11
For these key column combinations, NaN values were used for columns (Salary and Bonus) merged from the Right DataFrame i.e. salaryDfObj.
Key column value combinations that were unique in DataFrame salaryDfObj (right side DataFrame) were just skipped while merging i.e.
- ID 13 and Experience 33
- ID 16 and Experience 14
- ID 21 and Experience 10
So, basically in the Left Join, all the preference is given to the left side DataFrame.
Merge DataFrames along columns Using Right Join
What is Right Join ?
If Merging the two Dataframes on Key Columns and the Join type is “Right Join” then,
- Include all rows from the Right DataFrame.
- For the key combinations which are unique to Right DataFrame only, add NaN values for the columns included from Left DataFrame.
- Skip rows from the Left DataFrame which have unique values in the key combinations.
By default key is the combination of common column names. Like in our case it key will be a combination of column “ID” and “Experience”. In this example if we will pass how argument with value ‘right’. It will merge two datframes on key columns using right join,
# Merge two Dataframes on common columns using right join mergedDf = empDfObj.merge(salaryDfObj, how='right') print('Contents of the Merged Dataframe :') print(mergedDf)
Contents of the merged dataframe :
ID Name Age City Experience Salary Bonus 0 11 jack 34.0 Sydney 5 70000 1000 1 12 Riti 31.0 Delhi 7 72200 1100 2 13 NaN NaN NaN 33 84999 1000 3 14 Mohit 32.0 Delhi 15 90000 2000 4 15 Veena 33.0 Delhi 4 61000 1500 5 16 NaN NaN NaN 14 71000 1000 6 21 NaN NaN NaN 10 81000 2000
It picked all rows from right dataframe (salaryDfObj ) and but there were 3 key unique column combinations in the Dataframe salaryDfObj (right DataFrame) i.e.
- ID 13 and Experience 33
- ID 16 and Experience 14
- ID 21 and Experience 10
For these key column combinations, NaN values were used for columns (Name, Age and City) merged from the Left DataFrame i.e. empDfObj.
Key column value combinations that were unique in DataFrame empDfObj (left side DataFrame) were just skipped while merging i.e.
- ID 13 and Experience 11
- ID 16 and Experience 5
- ID 17 and Experience 11
So, basically in the Right Join, all the preference is given to the right side DataFrame.
Merge DataFrames along columns Using Outer Join
What is Outer Join ?
If Merging the two Dataframes on Key Columns and the Join type is “Outer Join” then,
- Include all rows from the Right & Left DataFrame.
- For the key combinations which are unique to Left DataFrame only, add NaN values for the columns included from Right DataFrame.
- For the key combinations which are unique to Right DataFrame only, add NaN values for the columns included from Left DataFrame.
By default key is the combination of common column names. Like in our case it key will be a combination of column “ID” and “Experience”. In this example if we will pass how argument with value ‘outer’. It will merge two datframes on key columns using Outer join,
# Merge two Dataframes on common columns using Outer join mergedDf = empDfObj.merge(salaryDfObj, how='outer') print('Contents of the Merged Dataframe :') print(mergedDf)
Contents of the merged dataframe :
Contents of the Merged Dataframe : ID Name Age City Experience Salary Bonus 0 11 jack 34.0 Sydney 5 70000.0 1000.0 1 12 Riti 31.0 Delhi 7 72200.0 1100.0 2 13 Aadi 16.0 New York 11 NaN NaN 3 14 Mohit 32.0 Delhi 15 90000.0 2000.0 4 15 Veena 33.0 Delhi 4 61000.0 1500.0 5 16 Shaunak 35.0 Mumbai 5 NaN NaN 6 17 Shaun 35.0 Colombo 11 NaN NaN 7 13 NaN NaN NaN 33 84999.0 1000.0 8 16 NaN NaN NaN 14 71000.0 1000.0 9 21 NaN NaN NaN 10 81000.0 2000.0
It picked all rows from right dataframe (salaryDfObj ) and but there were 3 key unique column combinations in the Dataframe salaryDfObj (right DataFrame) i.e.
- ID 13 and Experience 33
- ID 16 and Experience 14
- ID 21 and Experience 10
For these key column combinations, NaN values were used for columns (Name, Age and City) merged from the Left DataFrame i.e. empDfObj.
It picked all rows from left dataframe (empDfObj) and but there were 3 key unique column combinations in the Dataframe empDfObj (left DataFrame) i.e.
- ID 13 and Experience 11
- ID 16 and Experience 5
- ID 17 and Experience 11
For these key column combinations, NaN values were used for columns (Salary and Bonus) merged from the Right DataFrame i.e. salaryDfObj.
Basically in case of Outer Join, equal weightage is given to both Left and Right DataFrames.
Summary
We learned about different ways to merge or concatenate DataFrames in Pandas.
Latest Video Tutorials