Filter Pandas dataframe using ‘in’ & ‘not in’ like SQL

In this article, we will discuss multiple ways to filter a pandas DataFrame using the “in” and “not in” like functions in SQL. These functions are very handy to filter any DataFrame based on some custom conditions or predecided values.

Table of Contents

Preparation of solution

To quickly get started, let’s create a sample dataframe to experiment. We’ll use the pandas library with some random data.

import pandas as pd
import numpy as np

# List of Tuples
employees= [('Shubham', 'Data Scientist', 'Tech',   5),
            ('Riti', 'Data Engineer', 'Tech' ,   7),
            ('Shanky', 'Program Manager', 'PMO' ,   2),
            ('Shreya', 'Graphic Designer', 'Design' ,   2),
            ('Aadi', 'Backend Developer', 'Tech', 11),
            ('Sim', 'Data Engineer', 'Tech', 4)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Designation', 'Team', 'Experience'],
                  index=[0, 1, 2, 3, 4, 5])
print(df)

Contents of the created dataframe are,

      Name        Designation    Team  Experience
0  Shubham     Data Scientist    Tech           5
1     Riti      Data Engineer    Tech           7
2   Shanky    Program Manager     PMO           2
3   Shreya   Graphic Designer  Design           2
4     Aadi  Backend Developer    Tech          11
5      Sim      Data Engineer    Tech           4

Let’s try to filter the DataFrame on the following values in the column “Designation”.

Advertisements
filter_values = ['Data Scientist', 'Data Engineer']

There are different ways to filter DataFrame like SQL. Let’s discuss them one by one.

Using pandas.DataFrame.isin() function

The pandas.DataFrame.isin() function is the most straightforward approach to filter any column using certain values. Let’s look below to filter the column “Designation” using the values defined.

# filter rows containing values present in filter_values list
print (df[df['Designation'].isin(filter_values)])

Output

      Name     Designation  Team  Experience
0  Shubham  Data Scientist  Tech           5
1     Riti   Data Engineer  Tech           7
5      Sim   Data Engineer  Tech           4

It returns all the rows containing the following values, the same as the “in” condition in SQL. Now, let’s take a look at the case where we don’t want these values, i.e., the “not in” condition in SQL.

# filter rows that do not contain values present in the filter_values list
print (df[~df['Designation'].isin(filter_values)])

Output

     Name        Designation    Team  Experience
2  Shanky    Program Manager     PMO           2
3  Shreya   Graphic Designer  Design           2
4    Aadi  Backend Developer    Tech          11

The “~” (negation operator) resulted in rows where the values ‘Data Scientist’ and ‘Data Engineer’ are not present. Is there another similar way to achieving this, by setting the “isin” condition as False.

# filter rows that does not contains values present in filter_values list
print (df[df['Designation'].isin(filter_values) == False])

Output

     Name        Designation    Team  Experience
2  Shanky    Program Manager     PMO           2
3  Shreya   Graphic Designer  Design           2
4    Aadi  Backend Developer    Tech          11

This will also result in the same output as the “not in” function in SQL.

The complete example is as follows,

import pandas as pd
import numpy as np

# List of Tuples
employees= [('Shubham', 'Data Scientist', 'Tech',   5),
            ('Riti', 'Data Engineer', 'Tech' ,   7),
            ('Shanky', 'Program Manager', 'PMO' ,   2),
            ('Shreya', 'Graphic Designer', 'Design' ,   2),
            ('Aadi', 'Backend Developer', 'Tech', 11),
            ('Sim', 'Data Engineer', 'Tech', 4)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Designation', 'Team', 'Experience'],
                  index=[0, 1, 2, 3, 4, 5])
print(df)

filter_values = ['Data Scientist', 'Data Engineer']

# filter rows containing values present in filter_values list
print (df[df['Designation'].isin(filter_values)])

# filter rows that do not contain values present in the filter_values list
print (df[~df['Designation'].isin(filter_values)])

# filter rows that does not contains values present in filter_values list
print (df[df['Designation'].isin(filter_values) == False])

Output:

      Name        Designation    Team  Experience
0  Shubham     Data Scientist    Tech           5
1     Riti      Data Engineer    Tech           7
2   Shanky    Program Manager     PMO           2
3   Shreya   Graphic Designer  Design           2
4     Aadi  Backend Developer    Tech          11
5      Sim      Data Engineer    Tech           4

      Name     Designation  Team  Experience
0  Shubham  Data Scientist  Tech           5
1     Riti   Data Engineer  Tech           7
5      Sim   Data Engineer  Tech           4

     Name        Designation    Team  Experience
2  Shanky    Program Manager     PMO           2
3  Shreya   Graphic Designer  Design           2
4    Aadi  Backend Developer    Tech          11

     Name        Designation    Team  Experience
2  Shanky    Program Manager     PMO           2
3  Shreya   Graphic Designer  Design           2
4    Aadi  Backend Developer    Tech          11

Using numpy.isin() function

The numpy.isin() function is also very similar to the pandas one, with some minor changes in the implementation. Let’s look at the code below to understand both “in” and “not in” equivalent operators.

# filter rows containing values present in filter_values list
print (df[np.isin(df['Designation'], filter_values)])

Output

      Name     Designation  Team  Experience
0  Shubham  Data Scientist  Tech           5
1     Riti   Data Engineer  Tech           7
5      Sim   Data Engineer  Tech           4

As expected, it returns all the rows containing the selected values. To use the “not in” function, the np.isin function requires setting another attribute called “invert” as True.

# filter rows that do not contain values present in the filter_values list
print (df[np.isin(df['Designation'], filter_values, invert=True)])

Output

     Name        Designation    Team  Experience
2  Shanky    Program Manager     PMO           2
3  Shreya   Graphic Designer  Design           2
4    Aadi  Backend Developer    Tech          11

The output returns all the rows that do not contain any selected values.

Using query() method

In case you’re missing the SQL, we have a backup option for you as well. The query() function can help you use the “in” and “not in” in the same way as the general SQL query. Let’s take a look at the code below.

# filter rows containing values present in filter_values list
print (df.query("Designation in @filter_values"))

Output

      Name     Designation  Team  Experience
0  Shubham  Data Scientist  Tech           5
1     Riti   Data Engineer  Tech           7
5      Sim   Data Engineer  Tech           4

Here, we just need to mention the column name, then the “in” or “not in” operator, and then the values list.

# filter rows that do not contain values present in the filter_values list
print (df.query("Designation not in @filter_values"))

Output

     Name        Designation    Team  Experience
2  Shanky    Program Manager     PMO           2
3  Shreya   Graphic Designer  Design           2
4    Aadi  Backend Developer    Tech          11

The output returns all the rows that do not contain any selected values.

The complete example is as follows,

import pandas as pd
import numpy as np

# List of Tuples
employees= [('Shubham', 'Data Scientist', 'Tech',   5),
            ('Riti', 'Data Engineer', 'Tech' ,   7),
            ('Shanky', 'Program Manager', 'PMO' ,   2),
            ('Shreya', 'Graphic Designer', 'Design' ,   2),
            ('Aadi', 'Backend Developer', 'Tech', 11),
            ('Sim', 'Data Engineer', 'Tech', 4)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Designation', 'Team', 'Experience'],
                  index=[0, 1, 2, 3, 4, 5])
print(df)

filter_values = ['Data Scientist', 'Data Engineer']

# filter rows containing values present in filter_values list
print (df.query("Designation in @filter_values"))

# filter rows that do not contain values present in the filter_values list
print (df.query("Designation not in @filter_values"))

Output:

      Name        Designation    Team  Experience
0  Shubham     Data Scientist    Tech           5
1     Riti      Data Engineer    Tech           7
2   Shanky    Program Manager     PMO           2
3   Shreya   Graphic Designer  Design           2
4     Aadi  Backend Developer    Tech          11
5      Sim      Data Engineer    Tech           4

      Name     Designation  Team  Experience
0  Shubham  Data Scientist  Tech           5
1     Riti   Data Engineer  Tech           7
5      Sim   Data Engineer  Tech           4

     Name        Designation    Team  Experience
2  Shanky    Program Manager     PMO           2
3  Shreya   Graphic Designer  Design           2
4    Aadi  Backend Developer    Tech          11

Using apply() function

The apply() function can be used to achieve almost every task in the data analysis world. Let’s take a look at the implementation below.

# filter rows containing values present in filter_values list
print (df[df.apply(lambda x: x['Designation'] in filter_values, axis=1)])

Output

      Name     Designation  Team  Experience
0  Shubham  Data Scientist  Tech           5
1     Riti   Data Engineer  Tech           7
5      Sim   Data Engineer  Tech           4

Here, we are performing a row-wise check that whether the row values contain or do not contains values from the defined list (filter_values).

# filter rows that do not contain values present in the filter_values list
print (df[df.apply(lambda x: x['Designation'] not in filter_values, axis=1)])

Output

     Name        Designation    Team  Experience
2  Shanky    Program Manager     PMO           2
3  Shreya   Graphic Designer  Design           2
4    Aadi  Backend Developer    Tech          11

Summary

In this article, we have discussed multiple ways to filter a pandas DataFrame using ‘in’ and ‘not in’ like in SQL. Thanks.

Pandas Tutorials -Learn Data Analysis with Python

   

Are you looking to make a career in Data Science with Python?

Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.

Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.

Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.

Join a LinkedIn Community of Python Developers

Leave a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top