In this article we will discuss ways to find and select duplicate rows in a Dataframe based on all or given column names only.

DataFrame.duplicated()

In Python’s Pandas library, Dataframe class provides a member function to find duplicate rows based on all columns or some specific columns i.e.

It returns a Boolean Series with True value for each duplicated row.

Arguments:

  • subset :
    • Single or multiple column labels which should used for duplication check. If not provides all columns will
      be checked for finding duplicate rows.
  • keep :
    • Denotes the occurrence which should be marked as duplicate. It’s value can be {‘first’, ‘last’, False},
      default value is ‘first’.

      • first : All duplicates except their first occurrence will be marked as True
      • last : All duplicates except their last occurrence will be marked as True
      • False : All duplicates except will be marked as True

Some Examples :

Let’s create a Dataframe with some duplicate rows i.e.

Contents of this dataframe are,

Now let’s find duplicate rows in it.

Find Duplicate Rows based on all columns

To find & select the duplicate all rows based on all columns call the Daraframe.duplicate() without any subset argument. It will return a Boolean series with True at the place of each duplicated rows except their first occurrence (default value of keep argument is ‘first’). Then pass this Boolean Series to [] operator of Dataframe to select the rows which are duplicate i.e.

Output:

Here all duplicate rows except their first occurrence are returned because default value of keep argument was ‘first’.

If we want to select all duplicate rows except their last occurrence then we need to pass the keep argument as ‘last’ i.e.

Output:

Find Duplicate Rows based on selected columns

If we want to compare rows & find duplicates based on selected columns only then we should pass list of column names in subset argument of the Dataframe.duplicate() function. It will select & return duplicate rows based on these passed columns only.

For example let’s find & select rows based on a single column,

Output:

Here rows which has same value in ‘Name’ column are marked as duplicate and returned.

Another example : Find & select rows based on a two column names,

Output:

Here rows which has same values in ‘Age’  & ‘City’ columns are marked as duplicate and returned.

Complete executable code is as follows,

Output:

 

If you didn't find what you were looking, then do suggest us in the comments below. We will be more than happy to add that.

Do Subscribe with us for more Articles / Tutorials like this,