In this article we will discuss how to find duplicate columns in a Pandas DataFrame and drop them.
In Python’s pandas library there are direct APIs to find out the duplicate rows, but there is no direct API to find the duplicate columns.
So, we have to build our API for that.
First of all, create a DataFrame with duplicate columns i.e.
# List of Tuples students = [('jack', 34, 'Sydeny', 34, 'Sydeny', 34), ('Riti', 30, 'Delhi', 30, 'Delhi', 30), ('Aadi', 16, 'New York', 16, 'New York', 16), ('Riti', 30, 'Delhi', 30, 'Delhi', 30), ('Riti', 30, 'Delhi', 30, 'Delhi', 30), ('Riti', 30, 'Mumbai', 30, 'Mumbai', 30), ('Aadi', 40, 'London', 40, 'London', 40), ('Sachin', 30, 'Delhi', 30, 'Delhi', 30) ] # Create a DataFrame object dfObj = pd.DataFrame(students, columns=['Name', 'Age', 'City', 'Marks', 'Address', 'Pin']) print("Original Dataframe", dfObj, sep='\n')
Contents of the DataFrame created are,
Name Age City Marks Address Pin 0 jack 34 Sydeny 34 Sydeny 34 1 Riti 30 Delhi 30 Delhi 30 2 Aadi 16 New York 16 New York 16 3 Riti 30 Delhi 30 Delhi 30 4 Riti 30 Delhi 30 Delhi 30 5 Riti 30 Mumbai 30 Mumbai 30 6 Aadi 40 London 40 London 40 7 Sachin 30 Delhi 30 Delhi 30
Now as we can observer there are 3 duplicate columns in this DataFrame i.e. Marks, Address & Pin. Let’s see how to find them.
Frequently Asked:
- Replace NaN with zero in a column in Pandas
- Replace NaN with None in Pandas DataFrame
- Pandas: Get first N rows of dataframe
- How to Check if a Pandas Column contains a value?
Find duplicate columns in a DataFrame
To find these duplicate columns we need to iterate over DataFrame column wise and for every column it will search if any other column exists in DataFrame with same contents. If yes then then that column name will be stored in duplicate column list. In the end API will return the list of column names of duplicate columns i.e.
def getDuplicateColumns(df): ''' Get a list of duplicate columns. It will iterate over all the columns in dataframe and find the columns whose contents are duplicate. :param df: Dataframe object :return: List of columns whose contents are duplicates. ''' duplicateColumnNames = set() # Iterate over all the columns in dataframe for x in range(df.shape[1]): # Select column at xth index. col = df.iloc[:, x] # Iterate over all the columns in DataFrame from (x+1)th index till end for y in range(x + 1, df.shape[1]): # Select column at yth index. otherCol = df.iloc[:, y] # Check if two columns at x 7 y index are equal if col.equals(otherCol): duplicateColumnNames.add(df.columns.values[y]) return list(duplicateColumnNames)
Now let’s use this API to find the duplicate columns in above created DataFrame object dfObj i.e.
# Get list of duplicate columns duplicateColumnNames = getDuplicateColumns(dfObj) print('Duplicate Columns are as follows') for col in duplicateColumnNames: print('Column name : ', col)
Output:
Duplicate Columns are as follows Column name : Address Column name : Marks Column name : Pin
Drop duplicate columns in a DataFrame
To remove the duplicate columns we can pass the list of duplicate column’s names returned by our API to the dataframe.drop() i.e.
# Delete duplicate columns newDf = dfObj.drop(columns=getDuplicateColumns(dfObj)) print("Modified Dataframe", newDf, sep='\n')
Output:
Modified Dataframe Name Age City 0 jack 34 Sydeny 1 Riti 30 Delhi 2 Aadi 16 New York 3 Riti 30 Delhi 4 Riti 30 Delhi 5 Riti 30 Mumbai 6 Aadi 40 London 7 Sachin 30 Delhi
It will return a copy of existing DataFrame without duplicate columns.
Complete example is as follows,
import pandas as pd def getDuplicateColumns(df): ''' Get a list of duplicate columns. It will iterate over all the columns in dataframe and find the columns whose contents are duplicate. :param df: Dataframe object :return: List of columns whose contents are duplicates. ''' duplicateColumnNames = set() # Iterate over all the columns in dataframe for x in range(df.shape[1]): # Select column at xth index. col = df.iloc[:, x] # Iterate over all the columns in DataFrame from (x+1)th index till end for y in range(x + 1, df.shape[1]): # Select column at yth index. otherCol = df.iloc[:, y] # Check if two columns at x 7 y index are equal if col.equals(otherCol): duplicateColumnNames.add(df.columns.values[y]) return list(duplicateColumnNames) def main(): # List of Tuples students = [('jack', 34, 'Sydeny', 34, 'Sydeny', 34), ('Riti', 30, 'Delhi', 30, 'Delhi', 30), ('Aadi', 16, 'New York', 16, 'New York', 16), ('Riti', 30, 'Delhi', 30, 'Delhi', 30), ('Riti', 30, 'Delhi', 30, 'Delhi', 30), ('Riti', 30, 'Mumbai', 30, 'Mumbai', 30), ('Aadi', 40, 'London', 40, 'London', 40), ('Sachin', 30, 'Delhi', 30, 'Delhi', 30) ] # Create a DataFrame object dfObj = pd.DataFrame(students, columns=['Name', 'Age', 'City', 'Marks', 'Address', 'Pin']) print("Original Dataframe", dfObj, sep='\n') print('*** Find duplicate Columns in a DataFrame ***') # Get list of duplicate columns duplicateColumnNames = getDuplicateColumns(dfObj) print('Duplicate Columns are as follows') for col in duplicateColumnNames: print('Column name : ', col) print('*** Remove duplicate Columns in a DataFrame ***') # Delete duplicate columns newDf = dfObj.drop(columns=getDuplicateColumns(dfObj)) print("Modified Dataframe", newDf, sep='\n') if __name__ == '__main__': main()
Output:
Original Dataframe Name Age City Marks Address Pin 0 jack 34 Sydeny 34 Sydeny 34 1 Riti 30 Delhi 30 Delhi 30 2 Aadi 16 New York 16 New York 16 3 Riti 30 Delhi 30 Delhi 30 4 Riti 30 Delhi 30 Delhi 30 5 Riti 30 Mumbai 30 Mumbai 30 6 Aadi 40 London 40 London 40 7 Sachin 30 Delhi 30 Delhi 30 *** Find duplicate Columns in a DataFrame *** Duplicate Columns are as follows Column name : Address Column name : Pin Column name : Marks *** Remove duplicate Columns in a DataFrame *** Modified Dataframe Name Age City 0 jack 34 Sydeny 1 Riti 30 Delhi 2 Aadi 16 New York 3 Riti 30 Delhi 4 Riti 30 Delhi 5 Riti 30 Mumbai 6 Aadi 40 London 7 Sachin 30 Delhi