In this article, we will discuss multiple ways to filter a pandas DataFrame using the “in” and “not in” like functions in SQL. These functions are very handy to filter any DataFrame based on some custom conditions or predecided values.
Table of Contents
Preparation of solution
To quickly get started, let’s create a sample dataframe to experiment. We’ll use the pandas library with some random data.
import pandas as pd import numpy as np # List of Tuples employees= [('Shubham', 'Data Scientist', 'Tech', 5), ('Riti', 'Data Engineer', 'Tech' , 7), ('Shanky', 'Program Manager', 'PMO' , 2), ('Shreya', 'Graphic Designer', 'Design' , 2), ('Aadi', 'Backend Developer', 'Tech', 11), ('Sim', 'Data Engineer', 'Tech', 4)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Designation', 'Team', 'Experience'], index=[0, 1, 2, 3, 4, 5]) print(df)
Contents of the created dataframe are,
Name Designation Team Experience 0 Shubham Data Scientist Tech 5 1 Riti Data Engineer Tech 7 2 Shanky Program Manager PMO 2 3 Shreya Graphic Designer Design 2 4 Aadi Backend Developer Tech 11 5 Sim Data Engineer Tech 4
Let’s try to filter the DataFrame on the following values in the column “Designation”.
filter_values = ['Data Scientist', 'Data Engineer']
There are different ways to filter DataFrame like SQL. Let’s discuss them one by one.
Using pandas.DataFrame.isin() function
The pandas.DataFrame.isin() function is the most straightforward approach to filter any column using certain values. Let’s look below to filter the column “Designation” using the values defined.
Frequently Asked:
# filter rows containing values present in filter_values list print (df[df['Designation'].isin(filter_values)])
Output
Name Designation Team Experience 0 Shubham Data Scientist Tech 5 1 Riti Data Engineer Tech 7 5 Sim Data Engineer Tech 4
It returns all the rows containing the following values, the same as the “in” condition in SQL. Now, let’s take a look at the case where we don’t want these values, i.e., the “not in” condition in SQL.
# filter rows that do not contain values present in the filter_values list print (df[~df['Designation'].isin(filter_values)])
Output
Name Designation Team Experience 2 Shanky Program Manager PMO 2 3 Shreya Graphic Designer Design 2 4 Aadi Backend Developer Tech 11
The “~” (negation operator) resulted in rows where the values ‘Data Scientist’ and ‘Data Engineer’ are not present. Is there another similar way to achieving this, by setting the “isin” condition as False.
# filter rows that does not contains values present in filter_values list print (df[df['Designation'].isin(filter_values) == False])
Output
Name Designation Team Experience 2 Shanky Program Manager PMO 2 3 Shreya Graphic Designer Design 2 4 Aadi Backend Developer Tech 11
This will also result in the same output as the “not in” function in SQL.
The complete example is as follows,
import pandas as pd import numpy as np # List of Tuples employees= [('Shubham', 'Data Scientist', 'Tech', 5), ('Riti', 'Data Engineer', 'Tech' , 7), ('Shanky', 'Program Manager', 'PMO' , 2), ('Shreya', 'Graphic Designer', 'Design' , 2), ('Aadi', 'Backend Developer', 'Tech', 11), ('Sim', 'Data Engineer', 'Tech', 4)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Designation', 'Team', 'Experience'], index=[0, 1, 2, 3, 4, 5]) print(df) filter_values = ['Data Scientist', 'Data Engineer'] # filter rows containing values present in filter_values list print (df[df['Designation'].isin(filter_values)]) # filter rows that do not contain values present in the filter_values list print (df[~df['Designation'].isin(filter_values)]) # filter rows that does not contains values present in filter_values list print (df[df['Designation'].isin(filter_values) == False])
Output:
Name Designation Team Experience 0 Shubham Data Scientist Tech 5 1 Riti Data Engineer Tech 7 2 Shanky Program Manager PMO 2 3 Shreya Graphic Designer Design 2 4 Aadi Backend Developer Tech 11 5 Sim Data Engineer Tech 4 Name Designation Team Experience 0 Shubham Data Scientist Tech 5 1 Riti Data Engineer Tech 7 5 Sim Data Engineer Tech 4 Name Designation Team Experience 2 Shanky Program Manager PMO 2 3 Shreya Graphic Designer Design 2 4 Aadi Backend Developer Tech 11 Name Designation Team Experience 2 Shanky Program Manager PMO 2 3 Shreya Graphic Designer Design 2 4 Aadi Backend Developer Tech 11
Using numpy.isin() function
The numpy.isin() function is also very similar to the pandas one, with some minor changes in the implementation. Let’s look at the code below to understand both “in” and “not in” equivalent operators.
# filter rows containing values present in filter_values list print (df[np.isin(df['Designation'], filter_values)])
Output
Name Designation Team Experience 0 Shubham Data Scientist Tech 5 1 Riti Data Engineer Tech 7 5 Sim Data Engineer Tech 4
As expected, it returns all the rows containing the selected values. To use the “not in” function, the np.isin function requires setting another attribute called “invert” as True.
# filter rows that do not contain values present in the filter_values list print (df[np.isin(df['Designation'], filter_values, invert=True)])
Output
Name Designation Team Experience 2 Shanky Program Manager PMO 2 3 Shreya Graphic Designer Design 2 4 Aadi Backend Developer Tech 11
The output returns all the rows that do not contain any selected values.
Using query() method
In case you’re missing the SQL, we have a backup option for you as well. The query() function can help you use the “in” and “not in” in the same way as the general SQL query. Let’s take a look at the code below.
# filter rows containing values present in filter_values list print (df.query("Designation in @filter_values"))
Output
Name Designation Team Experience 0 Shubham Data Scientist Tech 5 1 Riti Data Engineer Tech 7 5 Sim Data Engineer Tech 4
Here, we just need to mention the column name, then the “in” or “not in” operator, and then the values list.
# filter rows that do not contain values present in the filter_values list print (df.query("Designation not in @filter_values"))
Output
Name Designation Team Experience 2 Shanky Program Manager PMO 2 3 Shreya Graphic Designer Design 2 4 Aadi Backend Developer Tech 11
The output returns all the rows that do not contain any selected values.
The complete example is as follows,
import pandas as pd import numpy as np # List of Tuples employees= [('Shubham', 'Data Scientist', 'Tech', 5), ('Riti', 'Data Engineer', 'Tech' , 7), ('Shanky', 'Program Manager', 'PMO' , 2), ('Shreya', 'Graphic Designer', 'Design' , 2), ('Aadi', 'Backend Developer', 'Tech', 11), ('Sim', 'Data Engineer', 'Tech', 4)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Designation', 'Team', 'Experience'], index=[0, 1, 2, 3, 4, 5]) print(df) filter_values = ['Data Scientist', 'Data Engineer'] # filter rows containing values present in filter_values list print (df.query("Designation in @filter_values")) # filter rows that do not contain values present in the filter_values list print (df.query("Designation not in @filter_values"))
Output:
Name Designation Team Experience 0 Shubham Data Scientist Tech 5 1 Riti Data Engineer Tech 7 2 Shanky Program Manager PMO 2 3 Shreya Graphic Designer Design 2 4 Aadi Backend Developer Tech 11 5 Sim Data Engineer Tech 4 Name Designation Team Experience 0 Shubham Data Scientist Tech 5 1 Riti Data Engineer Tech 7 5 Sim Data Engineer Tech 4 Name Designation Team Experience 2 Shanky Program Manager PMO 2 3 Shreya Graphic Designer Design 2 4 Aadi Backend Developer Tech 11
Using apply() function
The apply() function can be used to achieve almost every task in the data analysis world. Let’s take a look at the implementation below.
# filter rows containing values present in filter_values list print (df[df.apply(lambda x: x['Designation'] in filter_values, axis=1)])
Output
Name Designation Team Experience 0 Shubham Data Scientist Tech 5 1 Riti Data Engineer Tech 7 5 Sim Data Engineer Tech 4
Here, we are performing a row-wise check that whether the row values contain or do not contains values from the defined list (filter_values).
# filter rows that do not contain values present in the filter_values list print (df[df.apply(lambda x: x['Designation'] not in filter_values, axis=1)])
Output
Name Designation Team Experience 2 Shanky Program Manager PMO 2 3 Shreya Graphic Designer Design 2 4 Aadi Backend Developer Tech 11
Summary
In this article, we have discussed multiple ways to filter a pandas DataFrame using ‘in’ and ‘not in’ like in SQL. Thanks.