Drop Duplicate Rows from Pandas Dataframe

In this article, we will discuss different ways to delete duplicate rows in a pandas DataFrame.

Table of Contents:

A DataFrame is a data structure that stores the data in rows and columns. We can create a DataFrame using pandas.DataFrame() method. Let’s create a dataframe with 4 rows and 5 columns.

import pandas as pd

# Create dataframe with 4 rows and 5 columns
df= pd.DataFrame({'one':[0,0,0,0],
                  'two':[0,1,1,0],
                  'three':[0,0,0,0],
                  'four':[0,1,1,0],
                  'five':[34,56,56,34]})

# Display The dataframe
print(df)

Output:

   one  two  three  four  five
0    0    0      0     0    34
1    0    1      0     1    56
2    0    1      0     1    56
3    0    0      0     0    34

Drop duplicate rows from DataFrame using drop_duplicates()

The drop means removing the data from the given dataframe and the duplicate means same data occurred more than once.
We are going to drop duplicate rows. For that, we are going to use is drop_duplicates() method of the dataframe. Syntax is as follows:

Advertisements
df.drop_duplicates(subset=None, keep)

where, df is the input dataframe and other parameters are as follows:

  • subset takes an input list that contains the column labels to be included while identifying duplicates.
  • keep is a parameter that will controls which duplicate to keep and we can specify only three distinct value. They are
    • first – it is the default value and considers first value as the unique value and remaining as duplicate values.
    • last – it will consider the last value as the unique value and remaining as duplicate values
    • False – it will consider all same values as duplicate values

Drop Duplicate Rows from Dataframe by one column

We are going to use drop_duplicates() method to drop duplicate rows from one column. Syntax is as follows:

df.drop_duplicates(subset=['column name'])

where,
1. df is the input dataframe
2. column is the column name from which duplicates need to be removed.

Example: In this example, we are going to drop duplicate rows from the one column

import pandas as pd

# Create dataframe with 4 rows and 5 columns
df= pd.DataFrame({'one':[0,0,0,0],
                  'two':[0,1,1,0],
                  'three':[0,0,0,0],
                  'four':[0,1,1,0],
                  'five':[34,56,56,34]})

# Display The dataframe
print(df)

# Drop dupicates in one column
df = df.drop_duplicates(subset=['one'])

print('Modified Dataframe')

# Display The dataframe
print(df)

Output:

   one  two  three  four  five
0    0    0      0     0    34
1    0    1      0     1    56
2    0    1      0     1    56
3    0    0      0     0    34

Modified Dataframe

   one  two  three  four  five
0    0    0      0     0    34

Drop duplicate rows from dataframe by multiple columns

We are going to drop duplicate rows from multiple columns using drop_duplicates() method. Syntax is as follows:

df.drop_duplicates(subset=['column1','column2',...........,'column n'])

where,
1. df is the input dataframe
2. subset is the list of columns names from which duplicates need to be removed.

Example: In this example, we are going to drop first three columns based – ‘one’,’two’ and ‘three’

import pandas as pd

# Create dataframe with 4 rows and 5 columns
df= pd.DataFrame({'one':[0,0,0,0],
                  'two':[0,1,1,0],
                  'three':[0,0,0,0],
                  'four':[0,1,1,0],
                  'five':[34,56,56,34]})

# Display The dataframe
print(df)

# Drop dupicates from multiple columns
df = df.drop_duplicates(subset=['one','two','three'])

print('Modified Dataframe')

# Display The dataframe
print(df)

Output:

   one  two  three  four  five
0    0    0      0     0    34
1    0    1      0     1    56
2    0    1      0     1    56
3    0    0      0     0    34

Modified Dataframe

   one  two  three  four  five
0    0    0      0     0    34
1    0    1      0     1    56

Drop duplicate rows from dataframe by all column

We are going to drop duplicate rows from all columns. For that we can simply provide drop_duplicates() method with no parameters
Syntax:

df.drop_duplicates()

Example: In this example, we are going to drop duplicates rows from the entire dataframe.

import pandas as pd

# Create dataframe with 4 rows and 5 columns
df= pd.DataFrame({'one':[0,0,0,0],
                  'two':[0,1,1,0],
                  'three':[0,0,0,0],
                  'four':[0,1,1,0],
                  'five':[34,56,56,34]})

# Display The dataframe
print(df)

# Drop dupicates from entore Dataframe
df = df.drop_duplicates()

print('Modified Dataframe')

# Display The dataframe
print(df)

Output:

   one  two  three  four  five
0    0    0      0     0    34
1    0    1      0     1    56
2    0    1      0     1    56
3    0    0      0     0    34

Modified Dataframe

   one  two  three  four  five
0    0    0      0     0    34
1    0    1      0     1    56

Drop duplicate rows from dataframe using groupby()

Here we are going to use groupby() function to get unique rows from the dataframe by removing the duplicate rows. At last we have to use first() method to get the data only once. We can remove duplicate rows by multiple columns

Syntax:

df.groupby(['column1', 'column2',....,'column n']).first()

where,

  • df is the input dataframe
  • columns are the column names where duplicate data is removed base on the multiple columns
  • first() is used to get the first values from the grouped data

Example: Here, we are going to remove duplicates in ‘one’, ‘five’,’three’ columns

import pandas as pd

# Create dataframe with 4 rows and 5 columns
df= pd.DataFrame({'one':[0,0,0,0],
                  'two':[0,1,1,0],
                  'three':[0,0,0,0],
                  'four':[0,1,1,0],
                  'five':[34,56,56,34]})

# Display The dataframe
print(df)

# Drop dupicates rows by multiple columns
df = df.groupby(['one', 'five','three']).first()

print('Modified Dataframe')

# Display The dataframe
print(df)

Output:

   one  two  three  four  five
0    0    0      0     0    34
1    0    1      0     1    56
2    0    1      0     1    56
3    0    0      0     0    34

Modified Dataframe

                two  four
one five three
0   34   0        0     0
    56   0        1     1

Summary

In this article, we discussed how to drop duplicate rows from the dataframe using drop_duplicates() with three scenarios and using groupby() function.

Pandas Tutorials -Learn Data Analysis with Python

   

Are you looking to make a career in Data Science with Python?

Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.

Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.

Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.

Join a LinkedIn Community of Python Developers

Leave a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top