In this article we will discuss ways to find and select duplicate rows in a Dataframe based on all or given column names only.
DataFrame.duplicated()
In Python’s Pandas library, Dataframe class provides a member function to find duplicate rows based on all columns or some specific columns i.e.
DataFrame.duplicated(subset=None, keep='first')
It returns a Boolean Series with True value for each duplicated row.
Arguments:
- subset :
- Single or multiple column labels which should used for duplication check. If not provides all columns will
be checked for finding duplicate rows.
- Single or multiple column labels which should used for duplication check. If not provides all columns will
- keep :
- Denotes the occurrence which should be marked as duplicate. It’s value can be {‘first’, ‘last’, False},
default value is ‘first’.- first : All duplicates except their first occurrence will be marked as True
- last : All duplicates except their last occurrence will be marked as True
- False : All duplicates except will be marked as True
- Denotes the occurrence which should be marked as duplicate. It’s value can be {‘first’, ‘last’, False},
Some Examples :
Let’s create a Dataframe with some duplicate rows i.e.
Frequently Asked:
- Pandas Tutorial #6 – Introduction to DataFrame
- Replace Column values with Dictionary in Pandas Dataframe
- Pandas Dataframe.sum() method – Tutorial & Examples
- Replace NaN with given string in DataFrame in Pandas
# List of Tuples students = [('jack', 34, 'Sydeny'), ('Riti', 30, 'Delhi'), ('Aadi', 16, 'New York'), ('Riti', 30, 'Delhi'), ('Riti', 30, 'Delhi'), ('Riti', 30, 'Mumbai'), ('Aadi', 40, 'London'), ('Sachin', 30, 'Delhi') ] # Create a DataFrame object dfObj = pd.DataFrame(students, columns=['Name', 'Age', 'City'])
Contents of this dataframe are,
Name Age City 0 jack 34 Sydeny 1 Riti 30 Delhi 2 Aadi 16 New York 3 Riti 30 Delhi 4 Riti 30 Delhi 5 Riti 30 Mumbai 6 Aadi 40 London 7 Sachin 30 Delhi
Now let’s find duplicate rows in it.
Find Duplicate Rows based on all columns
To find & select the duplicate all rows based on all columns call the Daraframe.duplicate() without any subset argument. It will return a Boolean series with True at the place of each duplicated rows except their first occurrence (default value of keep argument is ‘first’). Then pass this Boolean Series to [] operator of Dataframe to select the rows which are duplicate i.e.
# Select duplicate rows except first occurrence based on all columns duplicateRowsDF = dfObj[dfObj.duplicated()] print("Duplicate Rows except first occurrence based on all columns are :") print(duplicateRowsDF)
Output:
Duplicate Rows except first occurrence based on all columns are : Name Age City 3 Riti 30 Delhi 4 Riti 30 Delhi
Here all duplicate rows except their first occurrence are returned because default value of keep argument was ‘first’.
If we want to select all duplicate rows except their last occurrence then we need to pass the keep argument as ‘last’ i.e.
# Select duplicate rows except last occurrence based on all columns duplicateRowsDF = dfObj[dfObj.duplicated(keep='last')] print("Duplicate Rows except last occurrence based on all columns are :") print(duplicateRowsDF)
Output:
Duplicate Rows except last occurrence based on all columns are : Name Age City 1 Riti 30 Delhi 3 Riti 30 Delhi
Find Duplicate Rows based on selected columns
If we want to compare rows & find duplicates based on selected columns only then we should pass list of column names in subset argument of the Dataframe.duplicate() function. It will select & return duplicate rows based on these passed columns only.
For example let’s find & select rows based on a single column,
# Select all duplicate rows based on one column duplicateRowsDF = dfObj[dfObj.duplicated(['Name'])] print("Duplicate Rows based on a single column are:", duplicateRowsDF, sep='\n')
Output:
Duplicate Rows based on a single column are: Name Age City 3 Riti 30 Delhi 4 Riti 30 Delhi 5 Riti 30 Mumbai 6 Aadi 40 London
Here rows which has same value in ‘Name’ column are marked as duplicate and returned.
Another example : Find & select rows based on a two column names,
# Select all duplicate rows based on multiple column names in list duplicateRowsDF = dfObj[dfObj.duplicated(['Age', 'City'])] print("Duplicate Rows based on 2 columns are:", duplicateRowsDF, sep='\n')
Output:
Duplicate Rows based on 2 columns are: Name Age City 3 Riti 30 Delhi 4 Riti 30 Delhi 7 Sachin 30 Delhi
Here rows which has same values in ‘Age’Â & ‘City’ columns are marked as duplicate and returned.
Complete executable code is as follows,
import pandas as pd def main(): # List of Tuples students = [('jack', 34, 'Sydeny'), ('Riti', 30, 'Delhi'), ('Aadi', 16, 'New York'), ('Riti', 30, 'Delhi'), ('Riti', 30, 'Delhi'), ('Riti', 30, 'Mumbai'), ('Aadi', 40, 'London'), ('Sachin', 30, 'Delhi') ] # Create a DataFrame object dfObj = pd.DataFrame(students, columns=['Name', 'Age', 'City']) print("Original Dataframe", dfObj, sep='\n') print('*** Find Duplicate Rows based on all columns ***') # Select duplicate rows except first occurrence based on all columns duplicateRowsDF = dfObj[dfObj.duplicated()] print("Duplicate Rows except first occurrence based on all columns are :") print(duplicateRowsDF) # Select duplicate rows except last occurrence based on all columns duplicateRowsDF = dfObj[dfObj.duplicated(keep='last')] print("Duplicate Rows except last occurrence based on all columns are :") print(duplicateRowsDF) # Select all duplicate rows based on all columns duplicateRowsDF = dfObj[dfObj.duplicated(keep=False)] print("All Duplicate Rows based on all columns are :") print(duplicateRowsDF) # Select all duplicate rows based on one column duplicateRowsDF = dfObj[dfObj.duplicated(['Name'])] print("Duplicate Rows based on a single column are:", duplicateRowsDF, sep='\n') # Select all duplicate rows based on multiple column names in list duplicateRowsDF = dfObj[dfObj.duplicated(['Age', 'City'])] print("Duplicate Rows based on 2 columns are:", duplicateRowsDF, sep='\n') if __name__ == '__main__': main()
Output:
Original Dataframe Name Age City 0 jack 34 Sydeny 1 Riti 30 Delhi 2 Aadi 16 New York 3 Riti 30 Delhi 4 Riti 30 Delhi 5 Riti 30 Mumbai 6 Aadi 40 London 7 Sachin 30 Delhi *** Find Duplicate Rows based on all columns *** Duplicate Rows except first occurrence based on all columns are : Name Age City 3 Riti 30 Delhi 4 Riti 30 Delhi Duplicate Rows except last occurrence based on all columns are : Name Age City 1 Riti 30 Delhi 3 Riti 30 Delhi All Duplicate Rows based on all columns are : Name Age City 1 Riti 30 Delhi 3 Riti 30 Delhi 4 Riti 30 Delhi Duplicate Rows based on a single column are: Name Age City 3 Riti 30 Delhi 4 Riti 30 Delhi 5 Riti 30 Mumbai 6 Aadi 40 London Duplicate Rows based on 2 columns are: Name Age City 3 Riti 30 Delhi 4 Riti 30 Delhi 7 Sachin 30 Delhi
amazing explanation