Drop Duplicate Rows from Pandas Dataframe

In this article, we will discuss different ways to delete duplicate rows in a pandas DataFrame.

Table of Contents:

Drop duplicate rows from DataFrame using drop_duplicates()
Drop duplicate rows from dataframe using groupby()

A DataFrame is a data structure that stores the data in rows and columns. We can create a DataFrame using pandas.DataFrame() method. Let’s create a dataframe with 4 rows and 5 columns.

import pandas as pd

# Create dataframe with 4 rows and 5 columns
df= pd.DataFrame({'one':[0,0,0,0],
                  'two':[0,1,1,0],
                  'three':[0,0,0,0],
                  'four':[0,1,1,0],
                  'five':[34,56,56,34]})

# Display The dataframe
print(df)

Output:

   one  two  three  four  five
0    0    0      0     0    34
1    0    1      0     1    56
2    0    1      0     1    56
3    0    0      0     0    34

Drop duplicate rows from DataFrame using drop_duplicates()

The drop means removing the data from the given dataframe and the duplicate means same data occurred more than once.
We are going to drop duplicate rows. For that, we are going to use is drop_duplicates() method of the dataframe. Syntax is as follows:

df.drop_duplicates(subset=None, keep)

where, df is the input dataframe and other parameters are as follows:

subset takes an input list that contains the column labels to be included while identifying duplicates.
keep is a parameter that will controls which duplicate to keep and we can specify only three distinct value. They are
- first – it is the default value and considers first value as the unique value and remaining as duplicate values.
- last – it will consider the last value as the unique value and remaining as duplicate values
- False – it will consider all same values as duplicate values

Drop Duplicate Rows from Dataframe by one column

We are going to use drop_duplicates() method to drop duplicate rows from one column. Syntax is as follows:

Frequently Asked:

df.drop_duplicates(subset=['column name'])

where,
1. df is the input dataframe
2. column is the column name from which duplicates need to be removed.

Example: In this example, we are going to drop duplicate rows from the one column

import pandas as pd

# Create dataframe with 4 rows and 5 columns
df= pd.DataFrame({'one':[0,0,0,0],
                  'two':[0,1,1,0],
                  'three':[0,0,0,0],
                  'four':[0,1,1,0],
                  'five':[34,56,56,34]})

# Display The dataframe
print(df)

# Drop dupicates in one column
df = df.drop_duplicates(subset=['one'])

print('Modified Dataframe')

# Display The dataframe
print(df)

Output:

   one  two  three  four  five
0    0    0      0     0    34
1    0    1      0     1    56
2    0    1      0     1    56
3    0    0      0     0    34

Modified Dataframe

   one  two  three  four  five
0    0    0      0     0    34

Drop duplicate rows from dataframe by multiple columns

We are going to drop duplicate rows from multiple columns using drop_duplicates() method. Syntax is as follows:

df.drop_duplicates(subset=['column1','column2',...........,'column n'])

where,
1. df is the input dataframe
2. subset is the list of columns names from which duplicates need to be removed.

Example: In this example, we are going to drop first three columns based – ‘one’,’two’ and ‘three’

import pandas as pd

# Create dataframe with 4 rows and 5 columns
df= pd.DataFrame({'one':[0,0,0,0],
                  'two':[0,1,1,0],
                  'three':[0,0,0,0],
                  'four':[0,1,1,0],
                  'five':[34,56,56,34]})

# Display The dataframe
print(df)

# Drop dupicates from multiple columns
df = df.drop_duplicates(subset=['one','two','three'])

print('Modified Dataframe')

# Display The dataframe
print(df)

Output:

   one  two  three  four  five
0    0    0      0     0    34
1    0    1      0     1    56
2    0    1      0     1    56
3    0    0      0     0    34

Modified Dataframe

   one  two  three  four  five
0    0    0      0     0    34
1    0    1      0     1    56

Drop duplicate rows from dataframe by all column

We are going to drop duplicate rows from all columns. For that we can simply provide drop_duplicates() method with no parameters
Syntax:

df.drop_duplicates()

Example: In this example, we are going to drop duplicates rows from the entire dataframe.

import pandas as pd

# Create dataframe with 4 rows and 5 columns
df= pd.DataFrame({'one':[0,0,0,0],
                  'two':[0,1,1,0],
                  'three':[0,0,0,0],
                  'four':[0,1,1,0],
                  'five':[34,56,56,34]})

# Display The dataframe
print(df)

# Drop dupicates from entore Dataframe
df = df.drop_duplicates()

print('Modified Dataframe')

# Display The dataframe
print(df)

Output:

   one  two  three  four  five
0    0    0      0     0    34
1    0    1      0     1    56
2    0    1      0     1    56
3    0    0      0     0    34

Modified Dataframe

   one  two  three  four  five
0    0    0      0     0    34
1    0    1      0     1    56

Drop duplicate rows from dataframe using groupby()

Here we are going to use groupby() function to get unique rows from the dataframe by removing the duplicate rows. At last we have to use first() method to get the data only once. We can remove duplicate rows by multiple columns

Syntax:

df.groupby(['column1', 'column2',....,'column n']).first()

where,

df is the input dataframe
columns are the column names where duplicate data is removed base on the multiple columns
first() is used to get the first values from the grouped data

Example: Here, we are going to remove duplicates in ‘one’, ‘five’,’three’ columns

import pandas as pd

# Create dataframe with 4 rows and 5 columns
df= pd.DataFrame({'one':[0,0,0,0],
                  'two':[0,1,1,0],
                  'three':[0,0,0,0],
                  'four':[0,1,1,0],
                  'five':[34,56,56,34]})

# Display The dataframe
print(df)

# Drop dupicates rows by multiple columns
df = df.groupby(['one', 'five','three']).first()

print('Modified Dataframe')

# Display The dataframe
print(df)

Output:

   one  two  three  four  five
0    0    0      0     0    34
1    0    1      0     1    56
2    0    1      0     1    56
3    0    0      0     0    34

Modified Dataframe

                two  four
one five three
0   34   0        0     0
    56   0        1     1

Summary

In this article, we discussed how to drop duplicate rows from the dataframe using drop_duplicates() with three scenarios and using groupby() function.

Drop duplicate rows from DataFrame using drop_duplicates()

Drop Duplicate Rows from Dataframe by one column

Frequently Asked:

Drop duplicate rows from dataframe by multiple columns

Drop duplicate rows from dataframe by all column

Drop duplicate rows from dataframe using groupby()

Summary

Related posts:

Share your love

Leave a Comment Cancel Reply