How to Find & Drop duplicate columns in a DataFrame | Python Pandas

In this article we will discuss how to find duplicate columns in a Pandas DataFrame and drop them.

In Python’s pandas library there are direct APIs to find out the duplicate rows, but there is no direct API to find the duplicate columns.
So, we have to build our API for that.

First of all, create a DataFrame with duplicate columns i.e.

# List of Tuples
students = [('jack', 34, 'Sydeny', 34, 'Sydeny', 34),
            ('Riti', 30, 'Delhi', 30, 'Delhi', 30),
            ('Aadi', 16, 'New York', 16, 'New York', 16),
            ('Riti', 30, 'Delhi', 30, 'Delhi', 30),
            ('Riti', 30, 'Delhi', 30, 'Delhi', 30),
            ('Riti', 30, 'Mumbai', 30, 'Mumbai', 30),
            ('Aadi', 40, 'London', 40, 'London', 40),
            ('Sachin', 30, 'Delhi', 30, 'Delhi', 30)
            ]

# Create a DataFrame object
dfObj = pd.DataFrame(students, columns=['Name', 'Age', 'City', 'Marks', 'Address', 'Pin'])

print("Original Dataframe", dfObj, sep='\n')

Contents of the DataFrame created are,

     Name  Age      City  Marks   Address  Pin
0    jack   34    Sydeny     34    Sydeny   34
1    Riti   30     Delhi     30     Delhi   30
2    Aadi   16  New York     16  New York   16
3    Riti   30     Delhi     30     Delhi   30
4    Riti   30     Delhi     30     Delhi   30
5    Riti   30    Mumbai     30    Mumbai   30
6    Aadi   40    London     40    London   40
7  Sachin   30     Delhi     30     Delhi   30

 

Now as we can observer there are 3 duplicate columns in this DataFrame i.e. Marks, Address & Pin. Let’s see how to find them.

Find duplicate columns in a DataFrame

To find these duplicate columns we need to iterate over DataFrame column wise and for every column it will search if any other column exists in DataFrame with same contents. If yes then then that column name will be stored in duplicate column list. In the end API will return the list of column names of duplicate columns i.e.

def getDuplicateColumns(df):
    '''
    Get a list of duplicate columns.
    It will iterate over all the columns in dataframe and find the columns whose contents are duplicate.
    :param df: Dataframe object
    :return: List of columns whose contents are duplicates.
    '''
    duplicateColumnNames = set()
    # Iterate over all the columns in dataframe
    for x in range(df.shape[1]):
        # Select column at xth index.
        col = df.iloc[:, x]
        # Iterate over all the columns in DataFrame from (x+1)th index till end
        for y in range(x + 1, df.shape[1]):
            # Select column at yth index.
            otherCol = df.iloc[:, y]
            # Check if two columns at x 7 y index are equal
            if col.equals(otherCol):
                duplicateColumnNames.add(df.columns.values[y])

    return list(duplicateColumnNames)

Now let’s use this API to find the duplicate columns in above created DataFrame object dfObj i.e.

# Get list of duplicate columns
duplicateColumnNames = getDuplicateColumns(dfObj)

print('Duplicate Columns are as follows')
for col in duplicateColumnNames:
    print('Column name : ', col)

Output:

Duplicate Columns are as follows
Column name :  Address
Column name :  Marks
Column name :  Pin

Drop duplicate columns in a DataFrame

To remove the duplicate columns we can pass the list of duplicate column’s names returned by our API to the dataframe.drop() i.e.

# Delete duplicate columns
newDf = dfObj.drop(columns=getDuplicateColumns(dfObj))

print("Modified Dataframe", newDf, sep='\n')

Output:

Modified Dataframe
     Name  Age      City
0    jack   34    Sydeny
1    Riti   30     Delhi
2    Aadi   16  New York
3    Riti   30     Delhi
4    Riti   30     Delhi
5    Riti   30    Mumbai
6    Aadi   40    London
7  Sachin   30     Delhi

It will return a copy of existing DataFrame without duplicate columns.

Complete example is as follows,

import pandas as pd


def getDuplicateColumns(df):
    '''
    Get a list of duplicate columns.
    It will iterate over all the columns in dataframe and find the columns whose contents are duplicate.
    :param df: Dataframe object
    :return: List of columns whose contents are duplicates.
    '''
    duplicateColumnNames = set()
    # Iterate over all the columns in dataframe
    for x in range(df.shape[1]):
        # Select column at xth index.
        col = df.iloc[:, x]
        # Iterate over all the columns in DataFrame from (x+1)th index till end
        for y in range(x + 1, df.shape[1]):
            # Select column at yth index.
            otherCol = df.iloc[:, y]
            # Check if two columns at x 7 y index are equal
            if col.equals(otherCol):
                duplicateColumnNames.add(df.columns.values[y])

    return list(duplicateColumnNames)


def main():
    # List of Tuples
    students = [('jack', 34, 'Sydeny', 34, 'Sydeny', 34),
                ('Riti', 30, 'Delhi', 30, 'Delhi', 30),
                ('Aadi', 16, 'New York', 16, 'New York', 16),
                ('Riti', 30, 'Delhi', 30, 'Delhi', 30),
                ('Riti', 30, 'Delhi', 30, 'Delhi', 30),
                ('Riti', 30, 'Mumbai', 30, 'Mumbai', 30),
                ('Aadi', 40, 'London', 40, 'London', 40),
                ('Sachin', 30, 'Delhi', 30, 'Delhi', 30)
                ]

    # Create a DataFrame object
    dfObj = pd.DataFrame(students, columns=['Name', 'Age', 'City', 'Marks', 'Address', 'Pin'])

    print("Original Dataframe", dfObj, sep='\n')

    print('*** Find duplicate Columns in a DataFrame ***')

    # Get list of duplicate columns
    duplicateColumnNames = getDuplicateColumns(dfObj)

    print('Duplicate Columns are as follows')
    for col in duplicateColumnNames:
        print('Column name : ', col)

    print('*** Remove duplicate Columns in a DataFrame ***')

    # Delete duplicate columns
    newDf = dfObj.drop(columns=getDuplicateColumns(dfObj))

    print("Modified Dataframe", newDf, sep='\n')

if __name__ == '__main__':
    main()

Output:

Original Dataframe
     Name  Age      City  Marks   Address  Pin
0    jack   34    Sydeny     34    Sydeny   34
1    Riti   30     Delhi     30     Delhi   30
2    Aadi   16  New York     16  New York   16
3    Riti   30     Delhi     30     Delhi   30
4    Riti   30     Delhi     30     Delhi   30
5    Riti   30    Mumbai     30    Mumbai   30
6    Aadi   40    London     40    London   40
7  Sachin   30     Delhi     30     Delhi   30
*** Find duplicate Columns in a DataFrame ***
Duplicate Columns are as follows
Column name :  Address
Column name :  Pin
Column name :  Marks
*** Remove duplicate Columns in a DataFrame ***
Modified Dataframe
     Name  Age      City
0    jack   34    Sydeny
1    Riti   30     Delhi
2    Aadi   16  New York
3    Riti   30     Delhi
4    Riti   30     Delhi
5    Riti   30    Mumbai
6    Aadi   40    London
7  Sachin   30     Delhi

 

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top