In this article we will discuss ways to find and select duplicate rows in a Dataframe based on all or given column names only.
DataFrame.duplicated()
In Python’s Pandas library, Dataframe class provides a member function to find duplicate rows based on all columns or some specific columns i.e.
DataFrame.duplicated(subset=None, keep='first')
It returns a Boolean Series with True value for each duplicated row.
Arguments:
- subset :
- Single or multiple column labels which should used for duplication check. If not provides all columns will
be checked for finding duplicate rows.
- Single or multiple column labels which should used for duplication check. If not provides all columns will
- keep :
- Denotes the occurrence which should be marked as duplicate. It’s value can be {‘first’, ‘last’, False},
default value is ‘first’.- first : All duplicates except their first occurrence will be marked as True
- last : All duplicates except their last occurrence will be marked as True
- False : All duplicates except will be marked as True
- Denotes the occurrence which should be marked as duplicate. It’s value can be {‘first’, ‘last’, False},
Some Examples :
Let’s create a Dataframe with some duplicate rows i.e.
# List of Tuples students = [('jack', 34, 'Sydeny'), ('Riti', 30, 'Delhi'), ('Aadi', 16, 'New York'), ('Riti', 30, 'Delhi'), ('Riti', 30, 'Delhi'), ('Riti', 30, 'Mumbai'), ('Aadi', 40, 'London'), ('Sachin', 30, 'Delhi') ] # Create a DataFrame object dfObj = pd.DataFrame(students, columns=['Name', 'Age', 'City'])
Contents of this dataframe are,
Name Age City 0 jack 34 Sydeny 1 Riti 30 Delhi 2 Aadi 16 New York 3 Riti 30 Delhi 4 Riti 30 Delhi 5 Riti 30 Mumbai 6 Aadi 40 London 7 Sachin 30 Delhi
Now let’s find duplicate rows in it.
Find Duplicate Rows based on all columns
To find & select the duplicate all rows based on all columns call the Daraframe.duplicate() without any subset argument. It will return a Boolean series with True at the place of each duplicated rows except their first occurrence (default value of keep argument is ‘first’). Then pass this Boolean Series to [] operator of Dataframe to select the rows which are duplicate i.e.
# Select duplicate rows except first occurrence based on all columns duplicateRowsDF = dfObj[dfObj.duplicated()] print("Duplicate Rows except first occurrence based on all columns are :") print(duplicateRowsDF)
Output:
Duplicate Rows except first occurrence based on all columns are : Name Age City 3 Riti 30 Delhi 4 Riti 30 Delhi
Here all duplicate rows except their first occurrence are returned because default value of keep argument was ‘first’.
If we want to select all duplicate rows except their last occurrence then we need to pass the keep argument as ‘last’ i.e.
# Select duplicate rows except last occurrence based on all columns duplicateRowsDF = dfObj[dfObj.duplicated(keep='last')] print("Duplicate Rows except last occurrence based on all columns are :") print(duplicateRowsDF)
Output:
Duplicate Rows except last occurrence based on all columns are : Name Age City 1 Riti 30 Delhi 3 Riti 30 Delhi
Find Duplicate Rows based on selected columns
If we want to compare rows & find duplicates based on selected columns only then we should pass list of column names in subset argument of the Dataframe.duplicate() function. It will select & return duplicate rows based on these passed columns only.
For example let’s find & select rows based on a single column,
# Select all duplicate rows based on one column duplicateRowsDF = dfObj[dfObj.duplicated(['Name'])] print("Duplicate Rows based on a single column are:", duplicateRowsDF, sep='\n')
Output:
Duplicate Rows based on a single column are: Name Age City 3 Riti 30 Delhi 4 Riti 30 Delhi 5 Riti 30 Mumbai 6 Aadi 40 London
Here rows which has same value in ‘Name’ column are marked as duplicate and returned.
Another example : Find & select rows based on a two column names,
# Select all duplicate rows based on multiple column names in list duplicateRowsDF = dfObj[dfObj.duplicated(['Age', 'City'])] print("Duplicate Rows based on 2 columns are:", duplicateRowsDF, sep='\n')
Output:
Duplicate Rows based on 2 columns are: Name Age City 3 Riti 30 Delhi 4 Riti 30 Delhi 7 Sachin 30 Delhi
Here rows which has same values in ‘Age’ & ‘City’ columns are marked as duplicate and returned.
Complete executable code is as follows,
import pandas as pd def main(): # List of Tuples students = [('jack', 34, 'Sydeny'), ('Riti', 30, 'Delhi'), ('Aadi', 16, 'New York'), ('Riti', 30, 'Delhi'), ('Riti', 30, 'Delhi'), ('Riti', 30, 'Mumbai'), ('Aadi', 40, 'London'), ('Sachin', 30, 'Delhi') ] # Create a DataFrame object dfObj = pd.DataFrame(students, columns=['Name', 'Age', 'City']) print("Original Dataframe", dfObj, sep='\n') print('*** Find Duplicate Rows based on all columns ***') # Select duplicate rows except first occurrence based on all columns duplicateRowsDF = dfObj[dfObj.duplicated()] print("Duplicate Rows except first occurrence based on all columns are :") print(duplicateRowsDF) # Select duplicate rows except last occurrence based on all columns duplicateRowsDF = dfObj[dfObj.duplicated(keep='last')] print("Duplicate Rows except last occurrence based on all columns are :") print(duplicateRowsDF) # Select all duplicate rows based on all columns duplicateRowsDF = dfObj[dfObj.duplicated(keep=False)] print("All Duplicate Rows based on all columns are :") print(duplicateRowsDF) # Select all duplicate rows based on one column duplicateRowsDF = dfObj[dfObj.duplicated(['Name'])] print("Duplicate Rows based on a single column are:", duplicateRowsDF, sep='\n') # Select all duplicate rows based on multiple column names in list duplicateRowsDF = dfObj[dfObj.duplicated(['Age', 'City'])] print("Duplicate Rows based on 2 columns are:", duplicateRowsDF, sep='\n') if __name__ == '__main__': main()
Output:
Original Dataframe Name Age City 0 jack 34 Sydeny 1 Riti 30 Delhi 2 Aadi 16 New York 3 Riti 30 Delhi 4 Riti 30 Delhi 5 Riti 30 Mumbai 6 Aadi 40 London 7 Sachin 30 Delhi *** Find Duplicate Rows based on all columns *** Duplicate Rows except first occurrence based on all columns are : Name Age City 3 Riti 30 Delhi 4 Riti 30 Delhi Duplicate Rows except last occurrence based on all columns are : Name Age City 1 Riti 30 Delhi 3 Riti 30 Delhi All Duplicate Rows based on all columns are : Name Age City 1 Riti 30 Delhi 3 Riti 30 Delhi 4 Riti 30 Delhi Duplicate Rows based on a single column are: Name Age City 3 Riti 30 Delhi 4 Riti 30 Delhi 5 Riti 30 Mumbai 6 Aadi 40 London Duplicate Rows based on 2 columns are: Name Age City 3 Riti 30 Delhi 4 Riti 30 Delhi 7 Sachin 30 Delhi
Pandas Tutorials -Learn Data Analysis with Python
-
Pandas Tutorial Part #1 - Introduction to Data Analysis with Python
-
Pandas Tutorial Part #2 - Basics of Pandas Series
-
Pandas Tutorial Part #3 - Get & Set Series values
-
Pandas Tutorial Part #4 - Attributes & methods of Pandas Series
-
Pandas Tutorial Part #5 - Add or Remove Pandas Series elements
-
Pandas Tutorial Part #6 - Introduction to DataFrame
-
Pandas Tutorial Part #7 - DataFrame.loc[] - Select Rows / Columns by Indexing
-
Pandas Tutorial Part #8 - DataFrame.iloc[] - Select Rows / Columns by Label Names
-
Pandas Tutorial Part #9 - Filter DataFrame Rows
-
Pandas Tutorial Part #10 - Add/Remove DataFrame Rows & Columns
-
Pandas Tutorial Part #11 - DataFrame attributes & methods
-
Pandas Tutorial Part #12 - Handling Missing Data or NaN values
-
Pandas Tutorial Part #13 - Iterate over Rows & Columns of DataFrame
-
Pandas Tutorial Part #14 - Sorting DataFrame by Rows or Columns
-
Pandas Tutorial Part #15 - Merging or Concatenating DataFrames
-
Pandas Tutorial Part #16 - DataFrame GroupBy explained with examples
Are you looking to make a career in Data Science with Python?
Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.
Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.
Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.
amazing explanation