Pandas Tutorial Part #15 – Merging DataFrames

In this tutorial, we will discuss how to join, merge or concatenate DataFrames in Pandas.

Table Of Contents

Concatenating DataFrames

We can concatenate one or more DataFrames along the Rows & Columns using the concat() function.

Concatenate DataFrames along the Rows

Let’s first create three different DataFrames with similar column names but different index labels i.e.

import pandas as pd

# List of Tuples
data1= [('Jack', 34, 'Sydney', 5) ,
        ('Riti', 31, 'Delhi' , 7) ,
        ('Aadi', 46, 'New York', 11)]

# List of Tuples
data2= [('Mohit', 34, 'Tokyo', 11) ,
        ('Veena', 31, 'London' , 10) ,
        ('Shaun', 36, 'Las Vegas', 12)]

# List of Tuples
data3= [('Mark', 47, 'Mumbai',   13) ,
        ('Jose', 43, 'Yokohama', 14) ,
        ('Ramu', 49, 'Paris',    15)]


# Create a DataFrame object from list of tuples
firstDf = pd.DataFrame( data1,
                        columns=['Name', 'Age', 'City', 'Experience'],
                        index = ['a', 'b', 'c'])

print('First DataFrame:')
# Display the First DataFrame
print(firstDf)

# Create a DataFrame object from list of tuples
secondDF = pd.DataFrame(data2,
                        columns=['Name', 'Age', 'City', 'Experience'],
                        index = ['d', 'e', 'f'])

print('Second DataFrame:')
# Display the second DataFrame
print(secondDF)

# Create a DataFrame object from list of tuples
thirdDF = pd.DataFrame( data3,
                        columns=['Name', 'Age', 'City', 'Experience'],
                        index = ['g', 'h', 'i'])

print('Third DataFrame:')
# Display the third DataFrame
print(thirdDF)

Output:

Advertisements
First DataFrame:
   Name  Age      City  Experience
a  Jack   34    Sydney           5
b  Riti   31     Delhi           7
c  Aadi   46  New York          11

Second DataFrame:
    Name  Age       City  Experience
d  Mohit   34      Tokyo          11
e  Veena   31     London          10
f  Shaun   36  Las Vegas          12

Third DataFrame:
   Name  Age      City  Experience
g  Mark   47    Mumbai          13
h  Jose   43  Yokohama          14
i  Ramu   49     Paris          15

All the three DataFrames have same number of columns.

In Python, the Pandas library provides a function concat() to concatenate two or more DataFrames along a particular axis. But the dimensions must match be same along the axis of concatenation. For example, all the three DataFrames created above have similar columns, so we can easily concatenate them along the Rows. For example,

# Concatenate three DataFrames along the Rows
df = pd.concat([firstDf, secondDF, thirdDF])

# Display the Concatenated DataFrame
print(df)

Output:

    Name  Age       City  Experience
a   Jack   34     Sydney           5
b   Riti   31      Delhi           7
c   Aadi   46   New York          11
d  Mohit   34      Tokyo          11
e  Veena   31     London          10
f  Shaun   36  Las Vegas          12
g   Mark   47     Mumbai          13
h   Jose   43   Yokohama          14
i   Ramu   49      Paris          15

We passed the list of DataFrames in the concat() function and the default value of axis argument is 0 i.e. axis=0. If axis is 0, it means that DataFrames will be merged along the rows i.e. rows of all the DataFrames will be added on top of each other.

Concatenate DataFrames along the Columns

To concatenate the DataFrames along the columns, we need to pass axis=1 in the conact() function. For example,

# Concatenate three DataFrames along the Columns
df = pd.concat([firstDf, secondDF, thirdDF], axis=1)

# Display the Concatenated DataFrame
print(df)

Output:

   Name   Age      City  Experience   Name  ...  Experience  Name   Age      City  Experience
a  Jack  34.0    Sydney         5.0    NaN  ...         NaN   NaN   NaN       NaN         NaN
b  Riti  31.0     Delhi         7.0    NaN  ...         NaN   NaN   NaN       NaN         NaN
c  Aadi  46.0  New York        11.0    NaN  ...         NaN   NaN   NaN       NaN         NaN
d   NaN   NaN       NaN         NaN  Mohit  ...        11.0   NaN   NaN       NaN         NaN
e   NaN   NaN       NaN         NaN  Veena  ...        10.0   NaN   NaN       NaN         NaN
f   NaN   NaN       NaN         NaN  Shaun  ...        12.0   NaN   NaN       NaN         NaN
g   NaN   NaN       NaN         NaN    NaN  ...         NaN  Mark  47.0    Mumbai        13.0
h   NaN   NaN       NaN         NaN    NaN  ...         NaN  Jose  43.0  Yokohama        14.0
i   NaN   NaN       NaN         NaN    NaN  ...         NaN  Ramu  49.0     Paris        15.0

[9 rows x 12 columns]

As we concatenated the DataFrames along the columns and all the DataFrames do not have similar Index labels, therefore for missing values the NaN values got Added. Like,

  • The first DataFrame had index labels ‘a’, ‘b’ and ‘c’ but it didn’t have index labels from ‘d’ to ‘i’. Therefore, for missing index labels, NaN values got added.
  • The second DataFrame had index labels ‘d’, ‘e’ and ‘f’ but it didn’t have index labels from ‘a’ to ‘c’ and ‘g’ to ‘i’. Therefore, for missing index labels, NaN values got added.
  • The third DataFrame had index labels ‘g’, ‘h’ and ‘i’ but it didn’t have index labels from ‘a’ to ‘f’. Therefore, for missing index labels, NaN values got added.

Therefore, it is essential to make sure the dimensions of DataFrame is correct along the axis of concatenation. Like in above example, we merged along the columns but all the three DataFrames didn’t have similar index labels. Due to which so many NaN values got added.

Merging DataFrames

In Pandas, the DataFrame provides a function merge() to merge the DataFrames. Let’s start with examples,

Suppose we have two DataFrames,

Dataframe 1:

This DataFrame contains the details of the employees like, ID, name, city, experience & Age i.e.

import pandas as pd

# List of Tuples
empoyees = [(11, 'jack', 34, 'Sydney', 5) ,
            (12, 'Riti', 31, 'Delhi' , 7) ,
            (13, 'Aadi', 16, 'New York', 11) ,
            (14, 'Mohit', 32,'Delhi' , 15) ,
            (15, 'Veena', 33, 'Delhi' , 4) ,
            (16, 'Shaunak', 35, 'Mumbai', 5 ),
            (17, 'Shaun', 35, 'Colombo', 11)]

# Create a DataFrame object
empDfObj = pd.DataFrame(empoyees,
                        columns=['ID', 'Name', 'Age', 'City', 'Experience'],
                        index=['a', 'b', 'c', 'd', 'e', 'f', 'h'])

print("Dataframe 1 : ")
print(empDfObj)

Contents of the first dataframe created are,

Dataframe 1 : 

   ID     Name  Age      City  Experience
a  11     jack   34    Sydney           5
b  12     Riti   31     Delhi           7
c  13     Aadi   16  New York          11
d  14    Mohit   32     Delhi          15
e  15    Veena   33     Delhi           4
f  16  Shaunak   35    Mumbai           5
h  17    Shaun   35   Colombo          11

Dataframe 2:

This DataFrame contains the details of the employees like, ID, salary, bonus and experience i.e.

import pandas as pd

# List of Tuples
salaries = [(11, 5, 70000, 1000) ,
            (12, 7, 72200, 1100) ,
            (13, 11, 84999, 1000) ,
            (14, 15, 90000, 2000) ,
            (15, 4, 61000, 1500) ,
            (16, 5, 71000, 1000),
            (21, 10,81000, 2000)]

# Create a DataFrame object
salaryDfObj = pd.DataFrame( salaries,
                            columns=['ID', 'Experience' , 'Salary', 'Bonus'],
                            index=['a', 'b', 'c', 'd', 'e', 'f', 'h'])

print("Dataframe 2 : ")
print(salaryDfObj)

Contents of the second dataframe created are,

Dataframe 2 :
 
   ID  Experience  Salary  Bonus
a  11           5   70000   1000
b  12           7   72200   1100
c  13          11   84999   1000
d  14          15   90000   2000
e  15           4   61000   1500
f  16           5   71000   1000
h  21          10   81000   2000

Now let’s see how to merge these DataFrames along the Columns.

Merge DataFrames along columns (Default Inner Join)

Our two DataFrames have 2 common column names i.e. ‘ID’ & ‘Experience’. If we directly call Dataframe.merge() on these two DataFrames, without any additional arguments. Then it will merge the columns of the both the DataFrames for the rows in which values of ‘ID’ & ‘Experience’ are equal i.e.

# Merge two Dataframes on common columns using default inner join
mergedDf = empDfObj.merge(salaryDfObj)

print('Contents of the Merged Dataframe :')
print(mergedDf)

Merged Dataframe contents are:

   ID   Name  Age    City  Experience  Salary  Bonus
0  11   jack   34  Sydney           5   70000   1000
1  12   Riti   31   Delhi           7   72200   1100
2  14  Mohit   32   Delhi          15   90000   2000
3  15  Veena   33   Delhi           4   61000   1500

It merged the contents of the both the DataFrames, but it picked only those rows where value of both the key columns “ID” & “Experience” were same. All other rows got skipped. It happened because by default “inner” join was used and key columns were “ID” and “Experience”.

In the above example,

Key column value combinations that were unique in DataFrame empDfObj (left DataFrame) were,

  • ID 13 and Experience 11
  • ID 16 and Experience 5
  • ID 17 and Experience 11

Key column value combinations that were unique in DataFrame salaryDfObj (right DataFrame) were,

  • ID 13 and Experience 33
  • ID 16 and Experience 14
  • ID 21 and Experience 10

As by default key is the common column names. Therefore key in our case was a combination of column “ID” and “Experience”. Therefore the above mentioned unique rows from both the DataFrames got skipped while merging.

So basically by default merge() function uses the Inner Join and it does the intersaction of keys in both the DataFrames. Results will be same if we pass how argument with value ‘inner’ i.e.

# Merge two Dataframes on common columns using default inner join
mergedDf = empDfObj.merge(salaryDfObj, how='inner')

print('Contents of the Merged Dataframe :')
print(mergedDf)

Merge DataFrames along columns Using Left Join

What is Left Join ?

If Merging the two Dataframes on Key Columns and the Join type is “Left Join” then,

  • Include all rows from the Left DataFrame.
  • For the key combinations which are unique to Left DataFrame only, add NaN values for the columns included from Right DataFrame.
  • Skip rows from the Right DataFrame which have unique values in the key combinations.

By default key is the combination of common column names. Like in our case it key will be a combination of column “ID” and “Experience”. In this example if we will pass how argument with value ‘left’. It will merge two datframes on key columns using left join,

# Merge two Dataframes on common columns using Left join
mergedDf = empDfObj.merge(salaryDfObj, how='left')

print('Contents of the Merged Dataframe :')
print(mergedDf)

Contents of the merged dataframe :

   ID     Name  Age      City  Experience   Salary   Bonus
0  11     jack   34    Sydney           5  70000.0  1000.0
1  12     Riti   31     Delhi           7  72200.0  1100.0
2  13     Aadi   16  New York          11      NaN     NaN
3  14    Mohit   32     Delhi          15  90000.0  2000.0
4  15    Veena   33     Delhi           4  61000.0  1500.0
5  16  Shaunak   35    Mumbai           5      NaN     NaN
6  17    Shaun   35   Colombo          11      NaN     NaN

It picked all rows from left dataframe (empDfObj) and but there were 3 key unique column combinations in the Dataframe empDfObj (left DataFrame) i.e.

  • ID 13 and Experience 11
  • ID 16 and Experience 5
  • ID 17 and Experience 11

For these key column combinations, NaN values were used for columns (Salary and Bonus) merged from the Right DataFrame i.e. salaryDfObj.

Key column value combinations that were unique in DataFrame salaryDfObj (right side DataFrame) were just skipped while merging i.e.

  • ID 13 and Experience 33
  • ID 16 and Experience 14
  • ID 21 and Experience 10

So, basically in the Left Join, all the preference is given to the left side DataFrame.

Merge DataFrames along columns Using Right Join

What is Right Join ?

If Merging the two Dataframes on Key Columns and the Join type is “Right Join” then,

  • Include all rows from the Right DataFrame.
  • For the key combinations which are unique to Right DataFrame only, add NaN values for the columns included from Left DataFrame.
  • Skip rows from the Left DataFrame which have unique values in the key combinations.

By default key is the combination of common column names. Like in our case it key will be a combination of column “ID” and “Experience”. In this example if we will pass how argument with value ‘right’. It will merge two datframes on key columns using right join,

# Merge two Dataframes on common columns using right join
mergedDf = empDfObj.merge(salaryDfObj, how='right')

print('Contents of the Merged Dataframe :')
print(mergedDf)

Contents of the merged dataframe :

   ID   Name   Age    City  Experience  Salary  Bonus
0  11   jack  34.0  Sydney           5   70000   1000
1  12   Riti  31.0   Delhi           7   72200   1100
2  13    NaN   NaN     NaN          33   84999   1000
3  14  Mohit  32.0   Delhi          15   90000   2000
4  15  Veena  33.0   Delhi           4   61000   1500
5  16    NaN   NaN     NaN          14   71000   1000
6  21    NaN   NaN     NaN          10   81000   2000

It picked all rows from right dataframe (salaryDfObj ) and but there were 3 key unique column combinations in the Dataframe salaryDfObj (right DataFrame) i.e.

  • ID 13 and Experience 33
  • ID 16 and Experience 14
  • ID 21 and Experience 10

For these key column combinations, NaN values were used for columns (Name, Age and City) merged from the Left DataFrame i.e. empDfObj.

Key column value combinations that were unique in DataFrame empDfObj (left side DataFrame) were just skipped while merging i.e.

  • ID 13 and Experience 11
  • ID 16 and Experience 5
  • ID 17 and Experience 11

So, basically in the Right Join, all the preference is given to the right side DataFrame.

Merge DataFrames along columns Using Outer Join

What is Outer Join ?

If Merging the two Dataframes on Key Columns and the Join type is “Outer Join” then,

  • Include all rows from the Right & Left DataFrame.
  • For the key combinations which are unique to Left DataFrame only, add NaN values for the columns included from Right DataFrame.
  • For the key combinations which are unique to Right DataFrame only, add NaN values for the columns included from Left DataFrame.

By default key is the combination of common column names. Like in our case it key will be a combination of column “ID” and “Experience”. In this example if we will pass how argument with value ‘outer’. It will merge two datframes on key columns using Outer join,

# Merge two Dataframes on common columns using Outer join
mergedDf = empDfObj.merge(salaryDfObj, how='outer')

print('Contents of the Merged Dataframe :')
print(mergedDf)

Contents of the merged dataframe :

Contents of the Merged Dataframe :


   ID     Name   Age      City  Experience   Salary   Bonus
0  11     jack  34.0    Sydney           5  70000.0  1000.0
1  12     Riti  31.0     Delhi           7  72200.0  1100.0
2  13     Aadi  16.0  New York          11      NaN     NaN
3  14    Mohit  32.0     Delhi          15  90000.0  2000.0
4  15    Veena  33.0     Delhi           4  61000.0  1500.0
5  16  Shaunak  35.0    Mumbai           5      NaN     NaN
6  17    Shaun  35.0   Colombo          11      NaN     NaN
7  13      NaN   NaN       NaN          33  84999.0  1000.0
8  16      NaN   NaN       NaN          14  71000.0  1000.0
9  21      NaN   NaN       NaN          10  81000.0  2000.0

It picked all rows from right dataframe (salaryDfObj ) and but there were 3 key unique column combinations in the Dataframe salaryDfObj (right DataFrame) i.e.

  • ID 13 and Experience 33
  • ID 16 and Experience 14
  • ID 21 and Experience 10

For these key column combinations, NaN values were used for columns (Name, Age and City) merged from the Left DataFrame i.e. empDfObj.

It picked all rows from left dataframe (empDfObj) and but there were 3 key unique column combinations in the Dataframe empDfObj (left DataFrame) i.e.

  • ID 13 and Experience 11
  • ID 16 and Experience 5
  • ID 17 and Experience 11

For these key column combinations, NaN values were used for columns (Salary and Bonus) merged from the Right DataFrame i.e. salaryDfObj.

Basically in case of Outer Join, equal weightage is given to both Left and Right DataFrames.

Summary

We learned about different ways to merge or concatenate DataFrames in Pandas.

Pandas Tutorials -Learn Data Analysis with Python

   

Are you looking to make a career in Data Science with Python?

Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.

Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.

Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.

Join a LinkedIn Community of Python Developers

Leave a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top