Pandas : Find duplicate rows based on all or few columns

In this article we will discuss ways to find and select duplicate rows in a Dataframe based on all or given column names only.

DataFrame.duplicated()

In Python’s Pandas library, Dataframe class provides a member function to find duplicate rows based on all columns or some specific columns i.e.

DataFrame.duplicated(subset=None, keep='first')

It returns a Boolean Series with True value for each duplicated row.

Arguments:

  • subset :
    • Single or multiple column labels which should used for duplication check. If not provides all columns will
      be checked for finding duplicate rows.
  • keep :
    • Denotes the occurrence which should be marked as duplicate. It’s value can be {‘first’, ‘last’, False},
      default value is ‘first’.

      • first : All duplicates except their first occurrence will be marked as True
      • last : All duplicates except their last occurrence will be marked as True
      • False : All duplicates except will be marked as True

Some Examples :

Let’s create a Dataframe with some duplicate rows i.e.

# List of Tuples
students = [('jack', 34, 'Sydeny'),
            ('Riti', 30, 'Delhi'),
            ('Aadi', 16, 'New York'),
            ('Riti', 30, 'Delhi'),
            ('Riti', 30, 'Delhi'),
            ('Riti', 30, 'Mumbai'),
            ('Aadi', 40, 'London'),
            ('Sachin', 30, 'Delhi')
            ]

# Create a DataFrame object
dfObj = pd.DataFrame(students, columns=['Name', 'Age', 'City'])

Contents of this dataframe are,

     Name  Age      City
0    jack   34    Sydeny
1    Riti   30     Delhi
2    Aadi   16  New York
3    Riti   30     Delhi
4    Riti   30     Delhi
5    Riti   30    Mumbai
6    Aadi   40    London
7  Sachin   30     Delhi

Now let’s find duplicate rows in it.

Find Duplicate Rows based on all columns

To find & select the duplicate all rows based on all columns call the Daraframe.duplicate() without any subset argument. It will return a Boolean series with True at the place of each duplicated rows except their first occurrence (default value of keep argument is ‘first’). Then pass this Boolean Series to [] operator of Dataframe to select the rows which are duplicate i.e.

# Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = dfObj[dfObj.duplicated()]

print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)

Output:

Duplicate Rows except first occurrence based on all columns are :
   Name  Age   City
3  Riti   30  Delhi
4  Riti   30  Delhi

Here all duplicate rows except their first occurrence are returned because default value of keep argument was ‘first’.

If we want to select all duplicate rows except their last occurrence then we need to pass the keep argument as ‘last’ i.e.

# Select duplicate rows except last occurrence based on all columns
duplicateRowsDF = dfObj[dfObj.duplicated(keep='last')]

print("Duplicate Rows except last occurrence based on all columns are :")
print(duplicateRowsDF)

Output:

Duplicate Rows except last occurrence based on all columns are :
   Name  Age   City
1  Riti   30  Delhi
3  Riti   30  Delhi

Find Duplicate Rows based on selected columns

If we want to compare rows & find duplicates based on selected columns only then we should pass list of column names in subset argument of the Dataframe.duplicate() function. It will select & return duplicate rows based on these passed columns only.

For example let’s find & select rows based on a single column,

# Select all duplicate rows based on one column
duplicateRowsDF = dfObj[dfObj.duplicated(['Name'])]

print("Duplicate Rows based on a single column are:", duplicateRowsDF, sep='\n')

Output:

Duplicate Rows based on a single column are:
   Name  Age    City
3  Riti   30   Delhi
4  Riti   30   Delhi
5  Riti   30  Mumbai
6  Aadi   40  London

Here rows which has same value in ‘Name’ column are marked as duplicate and returned.

Another example : Find & select rows based on a two column names,

# Select all duplicate rows based on multiple column names in list
duplicateRowsDF = dfObj[dfObj.duplicated(['Age', 'City'])]

print("Duplicate Rows based on 2 columns are:", duplicateRowsDF, sep='\n')

Output:

Duplicate Rows based on 2 columns are:
     Name  Age   City
3    Riti   30  Delhi
4    Riti   30  Delhi
7  Sachin   30  Delhi

Here rows which has same values in ‘Age’  & ‘City’ columns are marked as duplicate and returned.

Complete executable code is as follows,

import pandas as pd

def main():
    # List of Tuples
    students = [('jack', 34, 'Sydeny'),
                ('Riti', 30, 'Delhi'),
                ('Aadi', 16, 'New York'),
                ('Riti', 30, 'Delhi'),
                ('Riti', 30, 'Delhi'),
                ('Riti', 30, 'Mumbai'),
                ('Aadi', 40, 'London'),
                ('Sachin', 30, 'Delhi')
                ]

    # Create a DataFrame object
    dfObj = pd.DataFrame(students, columns=['Name', 'Age', 'City'])

    print("Original Dataframe", dfObj, sep='\n')

    print('*** Find Duplicate Rows based on all columns ***')

    # Select duplicate rows except first occurrence based on all columns
    duplicateRowsDF = dfObj[dfObj.duplicated()]

    print("Duplicate Rows except first occurrence based on all columns are :")
    print(duplicateRowsDF)

    # Select duplicate rows except last occurrence based on all columns
    duplicateRowsDF = dfObj[dfObj.duplicated(keep='last')]

    print("Duplicate Rows except last occurrence based on all columns are :")
    print(duplicateRowsDF)


    # Select all duplicate rows based on all columns
    duplicateRowsDF = dfObj[dfObj.duplicated(keep=False)]
    print("All Duplicate Rows based on all columns are :")
    print(duplicateRowsDF)


    # Select all duplicate rows based on one column
    duplicateRowsDF = dfObj[dfObj.duplicated(['Name'])]

    print("Duplicate Rows based on a single column are:", duplicateRowsDF, sep='\n')

    # Select all duplicate rows based on multiple column names in list
    duplicateRowsDF = dfObj[dfObj.duplicated(['Age', 'City'])]

    print("Duplicate Rows based on 2 columns are:", duplicateRowsDF, sep='\n')


if __name__ == '__main__':
    main()

Output:

Original Dataframe
     Name  Age      City
0    jack   34    Sydeny
1    Riti   30     Delhi
2    Aadi   16  New York
3    Riti   30     Delhi
4    Riti   30     Delhi
5    Riti   30    Mumbai
6    Aadi   40    London
7  Sachin   30     Delhi
*** Find Duplicate Rows based on all columns ***
Duplicate Rows except first occurrence based on all columns are :
   Name  Age   City
3  Riti   30  Delhi
4  Riti   30  Delhi
Duplicate Rows except last occurrence based on all columns are :
   Name  Age   City
1  Riti   30  Delhi
3  Riti   30  Delhi
All Duplicate Rows based on all columns are :
   Name  Age   City
1  Riti   30  Delhi
3  Riti   30  Delhi
4  Riti   30  Delhi
Duplicate Rows based on a single column are:
   Name  Age    City
3  Riti   30   Delhi
4  Riti   30   Delhi
5  Riti   30  Mumbai
6  Aadi   40  London
Duplicate Rows based on 2 columns are:
     Name  Age   City
3    Riti   30  Delhi
4    Riti   30  Delhi
7  Sachin   30  Delhi

 

1 thought on “Pandas : Find duplicate rows based on all or few columns”

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top