In this article, we will discuss different ways to delete duplicate rows in a pandas DataFrame.
Table of Contents:
- Drop duplicate rows from DataFrame using drop_duplicates()
- Drop duplicate rows from dataframe using groupby()
A DataFrame is a data structure that stores the data in rows and columns. We can create a DataFrame using pandas.DataFrame() method. Let’s create a dataframe with 4 rows and 5 columns.
import pandas as pd # Create dataframe with 4 rows and 5 columns df= pd.DataFrame({'one':[0,0,0,0], 'two':[0,1,1,0], 'three':[0,0,0,0], 'four':[0,1,1,0], 'five':[34,56,56,34]}) # Display The dataframe print(df)
Output:
one two three four five 0 0 0 0 0 34 1 0 1 0 1 56 2 0 1 0 1 56 3 0 0 0 0 34
Drop duplicate rows from DataFrame using drop_duplicates()
The drop means removing the data from the given dataframe and the duplicate means same data occurred more than once.
We are going to drop duplicate rows. For that, we are going to use is drop_duplicates() method of the dataframe. Syntax is as follows:
df.drop_duplicates(subset=None, keep)
where, df is the input dataframe and other parameters are as follows:
- subset takes an input list that contains the column labels to be included while identifying duplicates.
- keep is a parameter that will controls which duplicate to keep and we can specify only three distinct value. They are
- first – it is the default value and considers first value as the unique value and remaining as duplicate values.
- last – it will consider the last value as the unique value and remaining as duplicate values
- False – it will consider all same values as duplicate values
Drop Duplicate Rows from Dataframe by one column
We are going to use drop_duplicates() method to drop duplicate rows from one column. Syntax is as follows:
Frequently Asked:
- Pandas: Drop last N columns of dataframe
- Add a column with incremental values in Pandas dataFrame
- Replace NaN values in a Column in Pandas
- Pandas Tutorial #3 – Get & Set Series values
df.drop_duplicates(subset=['column name'])
where,
1. df is the input dataframe
2. column is the column name from which duplicates need to be removed.
Example: In this example, we are going to drop duplicate rows from the one column
import pandas as pd # Create dataframe with 4 rows and 5 columns df= pd.DataFrame({'one':[0,0,0,0], 'two':[0,1,1,0], 'three':[0,0,0,0], 'four':[0,1,1,0], 'five':[34,56,56,34]}) # Display The dataframe print(df) # Drop dupicates in one column df = df.drop_duplicates(subset=['one']) print('Modified Dataframe') # Display The dataframe print(df)
Output:
one two three four five 0 0 0 0 0 34 1 0 1 0 1 56 2 0 1 0 1 56 3 0 0 0 0 34 Modified Dataframe one two three four five 0 0 0 0 0 34
Drop duplicate rows from dataframe by multiple columns
We are going to drop duplicate rows from multiple columns using drop_duplicates() method. Syntax is as follows:
df.drop_duplicates(subset=['column1','column2',...........,'column n'])
where,
1. df is the input dataframe
2. subset is the list of columns names from which duplicates need to be removed.
Example: In this example, we are going to drop first three columns based – ‘one’,’two’ and ‘three’
import pandas as pd # Create dataframe with 4 rows and 5 columns df= pd.DataFrame({'one':[0,0,0,0], 'two':[0,1,1,0], 'three':[0,0,0,0], 'four':[0,1,1,0], 'five':[34,56,56,34]}) # Display The dataframe print(df) # Drop dupicates from multiple columns df = df.drop_duplicates(subset=['one','two','three']) print('Modified Dataframe') # Display The dataframe print(df)
Output:
one two three four five 0 0 0 0 0 34 1 0 1 0 1 56 2 0 1 0 1 56 3 0 0 0 0 34 Modified Dataframe one two three four five 0 0 0 0 0 34 1 0 1 0 1 56
Drop duplicate rows from dataframe by all column
We are going to drop duplicate rows from all columns. For that we can simply provide drop_duplicates() method with no parameters
Syntax:
df.drop_duplicates()
Example: In this example, we are going to drop duplicates rows from the entire dataframe.
import pandas as pd # Create dataframe with 4 rows and 5 columns df= pd.DataFrame({'one':[0,0,0,0], 'two':[0,1,1,0], 'three':[0,0,0,0], 'four':[0,1,1,0], 'five':[34,56,56,34]}) # Display The dataframe print(df) # Drop dupicates from entore Dataframe df = df.drop_duplicates() print('Modified Dataframe') # Display The dataframe print(df)
Output:
one two three four five 0 0 0 0 0 34 1 0 1 0 1 56 2 0 1 0 1 56 3 0 0 0 0 34 Modified Dataframe one two three four five 0 0 0 0 0 34 1 0 1 0 1 56
Drop duplicate rows from dataframe using groupby()
Here we are going to use groupby() function to get unique rows from the dataframe by removing the duplicate rows. At last we have to use first() method to get the data only once. We can remove duplicate rows by multiple columns
Syntax:
df.groupby(['column1', 'column2',....,'column n']).first()
where,
- df is the input dataframe
- columns are the column names where duplicate data is removed base on the multiple columns
- first() is used to get the first values from the grouped data
Example: Here, we are going to remove duplicates in ‘one’, ‘five’,’three’ columns
import pandas as pd # Create dataframe with 4 rows and 5 columns df= pd.DataFrame({'one':[0,0,0,0], 'two':[0,1,1,0], 'three':[0,0,0,0], 'four':[0,1,1,0], 'five':[34,56,56,34]}) # Display The dataframe print(df) # Drop dupicates rows by multiple columns df = df.groupby(['one', 'five','three']).first() print('Modified Dataframe') # Display The dataframe print(df)
Output:
one two three four five 0 0 0 0 0 34 1 0 1 0 1 56 2 0 1 0 1 56 3 0 0 0 0 34 Modified Dataframe two four one five three 0 34 0 0 0 56 0 1 1
Summary
In this article, we discussed how to drop duplicate rows from the dataframe using drop_duplicates() with three scenarios and using groupby() function.