How to compare two columns in a pandas DataFrame?

In this article, we will discuss how to compare two columns in a pandas DataFrame.

Table of Contents

Preparing DataSet

To quickly get started, let’s create a sample dataframe to experiment. We’ll use the pandas library with some random data.

import pandas as pd
import numpy as np

# List of Tuples
employees = [('Shubham', 'India', 'Tech', 5, 4),
            ('Riti', 'India', 'Design', 7, 7),
            ('Shanky', 'India', 'PMO', 2, 2),
            ('Shreya', 'India', 'Design', 2, 0),
            ('Aadi', 'US', 'PMO', 11, 5),
            ('Sim', 'US', 'Tech', 4, 4)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Location', 'Team', 'Experience', 'RelevantExperience'])
print(df)

Contents of the created dataframe are,

      Name Location    Team  Experience  RelevantExperience
0  Shubham    India    Tech           5                   4
1     Riti    India  Design           7                   7
2   Shanky    India     PMO           2                   2
3   Shreya    India  Design           2                   0
4     Aadi       US     PMO          11                   5
5      Sim       US    Tech           4                   4

Let’s say, we need to compare columns “Experience” and “RelevantExperience”, and check where these two column values are equal.

Using numpy.where() method

The numpy.where() is a very important and handy function to compare or create a new column based on some conditions on the existing columns. Let’s look at the function’s main attribute

np.where(condition, value if condition is True, value if condition is False)

Now let’s use that to compare the columns “Experience” and “RelevantExperience”.

# compare the columns
df['condition'] = np.where(df['Experience'] == df['RelevantExperience'], True, False)

print (df)

Output

      Name Location    Team  Experience  RelevantExperience  condition
0  Shubham    India    Tech           5                   4      False
1     Riti    India  Design           7                   7       True
2   Shanky    India     PMO           2                   2       True
3   Shreya    India  Design           2                   0      False
4     Aadi       US     PMO          11                   5      False
5      Sim       US    Tech           4                   4       True

We have used the equals condition between the two columns, which returns True if the condition is True and False if the condition is False.

Using numpy.select() method

The numpy.select() is a similar method but the implementation is slightly different. Here, we need to separately provide the conditions and choices (if the condition returns True), which are ultimately put into the numpy.select method. Let’s understand with the same example.

# compare the columns
conditions = [
    (df['Experience'] == df['RelevantExperience'])
]

choices = [True]

# default is the value if none of the conditions satisfies
df['condition'] = np.select(conditions, choices, default=False)

print (df)

Output

      Name Location    Team  Experience  RelevantExperience  condition
0  Shubham    India    Tech           5                   4      False
1     Riti    India  Design           7                   7       True
2   Shanky    India     PMO           2                   2       True
3   Shreya    India  Design           2                   0      False
4     Aadi       US     PMO          11                   5      False
5      Sim       US    Tech           4                   4       True

This also yields a similar output as the above method. However, this is more used when there are many conditions that might make the numpy.where method a little difficult to understand.

Using apply() method

We can also use the apply() method to compare two different columns. Let’s understand with the same example as above.

# using apply function
df['condition']=df.apply(lambda x: True if x['Experience'] == x['RelevantExperience'] else False, axis=1)

print (df)

Output

      Name Location    Team  Experience  RelevantExperience  condition
0  Shubham    India    Tech           5                   4      False
1     Riti    India  Design           7                   7       True
2   Shanky    India     PMO           2                   2       True
3   Shreya    India  Design           2                   0      False
4     Aadi       US     PMO          11                   5      False
5      Sim       US    Tech           4                   4       True

In case of complicated conditions, we can create a custom function and feed that into the apply method to keep the code clean and easy to understand.

Using equals() method

The equals method is generally used to check if any two Series or DataFrames have the same shape and values. It gives an overall comparison rather than the row-wise comparison that we have seen in all the above methods.

# check if the columns are equals
print (df['Experience'].equals(df['RelevantExperience']))

Output

False

This gives a single-value boolean output, False meaning that all values within these two columns are not the same.

Summary

Today, we learned how to compare two columns in a pandas DataFrame.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top