In this article, we will discuss how to compare two columns in a pandas DataFrame.
Table of Contents
Preparing DataSet
To quickly get started, let’s create a sample dataframe to experiment. We’ll use the pandas library with some random data.
import pandas as pd import numpy as np # List of Tuples employees = [('Shubham', 'India', 'Tech', 5, 4), ('Riti', 'India', 'Design', 7, 7), ('Shanky', 'India', 'PMO', 2, 2), ('Shreya', 'India', 'Design', 2, 0), ('Aadi', 'US', 'PMO', 11, 5), ('Sim', 'US', 'Tech', 4, 4)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Location', 'Team', 'Experience', 'RelevantExperience']) print(df)
Contents of the created dataframe are,
Name Location Team Experience RelevantExperience 0 Shubham India Tech 5 4 1 Riti India Design 7 7 2 Shanky India PMO 2 2 3 Shreya India Design 2 0 4 Aadi US PMO 11 5 5 Sim US Tech 4 4
Let’s say, we need to compare columns “Experience” and “RelevantExperience”, and check where these two column values are equal.
Using numpy.where() method
The numpy.where() is a very important and handy function to compare or create a new column based on some conditions on the existing columns. Let’s look at the function’s main attribute
np.where(condition, value if condition is True, value if condition is False)
Now let’s use that to compare the columns “Experience” and “RelevantExperience”.
Frequently Asked:
# compare the columns df['condition'] = np.where(df['Experience'] == df['RelevantExperience'], True, False) print (df)
Output
Name Location Team Experience RelevantExperience condition 0 Shubham India Tech 5 4 False 1 Riti India Design 7 7 True 2 Shanky India PMO 2 2 True 3 Shreya India Design 2 0 False 4 Aadi US PMO 11 5 False 5 Sim US Tech 4 4 True
We have used the equals condition between the two columns, which returns True if the condition is True and False if the condition is False.
Using numpy.select() method
The numpy.select() is a similar method but the implementation is slightly different. Here, we need to separately provide the conditions and choices (if the condition returns True), which are ultimately put into the numpy.select method. Let’s understand with the same example.
# compare the columns conditions = [ (df['Experience'] == df['RelevantExperience']) ] choices = [True] # default is the value if none of the conditions satisfies df['condition'] = np.select(conditions, choices, default=False) print (df)
Output
Name Location Team Experience RelevantExperience condition 0 Shubham India Tech 5 4 False 1 Riti India Design 7 7 True 2 Shanky India PMO 2 2 True 3 Shreya India Design 2 0 False 4 Aadi US PMO 11 5 False 5 Sim US Tech 4 4 True
This also yields a similar output as the above method. However, this is more used when there are many conditions that might make the numpy.where method a little difficult to understand.
Using apply() method
We can also use the apply() method to compare two different columns. Let’s understand with the same example as above.
# using apply function df['condition']=df.apply(lambda x: True if x['Experience'] == x['RelevantExperience'] else False, axis=1) print (df)
Output
Name Location Team Experience RelevantExperience condition 0 Shubham India Tech 5 4 False 1 Riti India Design 7 7 True 2 Shanky India PMO 2 2 True 3 Shreya India Design 2 0 False 4 Aadi US PMO 11 5 False 5 Sim US Tech 4 4 True
In case of complicated conditions, we can create a custom function and feed that into the apply method to keep the code clean and easy to understand.
Using equals() method
The equals method is generally used to check if any two Series or DataFrames have the same shape and values. It gives an overall comparison rather than the row-wise comparison that we have seen in all the above methods.
# check if the columns are equals print (df['Experience'].equals(df['RelevantExperience']))
Output
False
This gives a single-value boolean output, False meaning that all values within these two columns are not the same.
Summary
Today, we learned how to compare two columns in a pandas DataFrame.