Select Rows from Pandas DataFrame based on column values

In this article, we will discuss different scenarios to select rows from a Pandas DataFrame based on the column values.

Table of Contents

Introduction

Let’s create a sample dataframe to experiment with different scenarios. We’ll use the pandas library with some random data.

import pandas as pd

# List of Tuples
employees = [('Shubham', 'Data Scientist', 'Sydney',   5),
            ('Riti', 'Data Analyst', 'Delhi' ,   7),
            ('Shanky', 'Program Manager', 'Delhi' ,   2),
            ('Shreya', 'Graphic Designer', 'Mumbai' ,   2),
            ('Aadi', 'Data Engineering', 'New York', 11)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Designation', 'City', 'Experience'],
                  index=['a', 'b', 'c', 'd', 'e'])
print(df)

Contents of the created dataframe are,

      Name       Designation      City  Experience
a  Shubham    Data Scientist    Sydney           5
b     Riti      Data Analyst     Delhi           7
c   Shanky   Program Manager     Delhi           2
d   Shreya  Graphic Designer    Mumbai           2
e     Aadi  Data Engineering  New York          11

Now, let’s look at different scenarios in which we could select rows based on the column values.

Select DaraFrame Rows based on a specific value(s)

In this scenario, let’s say, we need to select rows from the dataframe based on some specific column values. For example, we need to select rows of the employees having experience of more than 4 years.

# filter employees having Experience > 4
filterDf = df[df['Experience'] > 4]

print (filterDf)

Output

      Name       Designation      City  Experience
a  Shubham    Data Scientist    Sydney           5
b     Riti      Data Analyst     Delhi           7
e     Aadi  Data Engineering  New York          11

We have filtered all the three employees having experience more than 4 years. Now, in case, if we need to select the employees having experience of more than 4 years and less than or equal to 10 years, we would need to include multiple conditions as shown below.

# filter employees having Experience > 4 and Experience < 10
filterDf = df[(df['Experience'] > 4) & (df['Experience'] <= 10)]

print (filterDf)

Output

      Name     Designation    City  Experience
a  Shubham  Data Scientist  Sydney           5
b     Riti    Data Analyst   Delhi           7

Note: We could use pipe operator “|” in case of “OR” conditions.

The complete example is as follows,

import pandas as pd

# List of Tuples
employees = [('Shubham', 'Data Scientist', 'Sydney',   5),
            ('Riti', 'Data Analyst', 'Delhi' ,   7),
            ('Shanky', 'Program Manager', 'Delhi' ,   2),
            ('Shreya', 'Graphic Designer', 'Mumbai' ,   2),
            ('Aadi', 'Data Engineering', 'New York', 11)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Designation', 'City', 'Experience'],
                  index=['a', 'b', 'c', 'd', 'e'])
print(df)

# filter employees having Experience > 4
filterDf = df[df['Experience'] > 4]

print (filterDf)

# filter employees having Experience > 4 and Experience < 10
filterDf = df[(df['Experience'] > 4) & (df['Experience'] <= 10)]

print (filterDf)

Output:

      Name       Designation      City  Experience
a  Shubham    Data Scientist    Sydney           5
b     Riti      Data Analyst     Delhi           7
c   Shanky   Program Manager     Delhi           2
d   Shreya  Graphic Designer    Mumbai           2
e     Aadi  Data Engineering  New York          11

      Name       Designation      City  Experience
a  Shubham    Data Scientist    Sydney           5
b     Riti      Data Analyst     Delhi           7
e     Aadi  Data Engineering  New York          11

      Name     Designation    City  Experience
a  Shubham  Data Scientist  Sydney           5
b     Riti    Data Analyst   Delhi           7

Select DataFrame Rows based on a multiple values

In this scenario, we need to select the rows in dataframe based on a list of column values. For example, we need to select the employees in “Delhi” and “Mumbai” cities. One solution could be to use multiple AND/OR operators, but it would get convoluted with long list of values; which is where “isin” operator comes in handy.

# filter employees in "Delhi" and "Mumbai" cities
filterDf = df[df['City'].isin(['Mumbai', 'Delhi'])]

print (filterDf)

Output

     Name       Designation    City  Experience
b    Riti      Data Analyst   Delhi           7
c  Shanky   Program Manager   Delhi           2
d  Shreya  Graphic Designer  Mumbai           2

We can again include multiple “isin” conditions using the AND/OR operators.

Select DataFrame Rows containing partial string or substring

In this scenario, we need to filter the dataframe based on partial string matching. For example, we need to select the employees working in “Data” field (information from the Designation).

# filter employees having "Data" in their designation
filterDf = df[df['Designation'].str.contains("Data")]

print (filterDf)

Output

      Name       Designation      City  Experience
a  Shubham    Data Scientist    Sydney           5
b     Riti      Data Analyst     Delhi           7
e     Aadi  Data Engineering  New York          11

Select DataFrame Rows using the query method

The newly introduced query method for pandas dataframe is a bonus for all SQL lovers. The syntax is quite similar to standard SQL query, and it is very efficient for the large dataframes. Let’s take a quick example – select all the employees from “Delhi”.

# filter employees from "Delhi"
filterDf = df.query("City == 'Delhi'")

print (filterDf)

Output

     Name      Designation   City  Experience
b    Riti     Data Analyst  Delhi           7
c  Shanky  Program Manager  Delhi           2

Important Note: Please make sure you are using pandas version > 0.25.0 for using the query method.

The complete example is as follows,

import pandas as pd

# List of Tuples
employees = [('Shubham', 'Data Scientist', 'Sydney',   5),
            ('Riti', 'Data Analyst', 'Delhi' ,   7),
            ('Shanky', 'Program Manager', 'Delhi' ,   2),
            ('Shreya', 'Graphic Designer', 'Mumbai' ,   2),
            ('Aadi', 'Data Engineering', 'New York', 11)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Designation', 'City', 'Experience'],
                  index=['a', 'b', 'c', 'd', 'e'])
print(df)

# filter employees in "Delhi" and "Mumbai" cities
filterDf = df[df['City'].isin(['Mumbai', 'Delhi'])]

print (filterDf)

# filter employees having "Data" in their designation
filterDf = df[df['Designation'].str.contains("Data")]

print (filterDf)

# filter employees from "Delhi"
filterDf = df.query("City == 'Delhi'")

print (filterDf)

Output

      Name       Designation      City  Experience
a  Shubham    Data Scientist    Sydney           5
b     Riti      Data Analyst     Delhi           7
c   Shanky   Program Manager     Delhi           2
d   Shreya  Graphic Designer    Mumbai           2
e     Aadi  Data Engineering  New York          11

     Name       Designation    City  Experience
b    Riti      Data Analyst   Delhi           7
c  Shanky   Program Manager   Delhi           2
d  Shreya  Graphic Designer  Mumbai           2

      Name       Designation      City  Experience
a  Shubham    Data Scientist    Sydney           5
b     Riti      Data Analyst     Delhi           7
e     Aadi  Data Engineering  New York          11

     Name      Designation   City  Experience
b    Riti     Data Analyst  Delhi           7
c  Shanky  Program Manager  Delhi           2

Summary

Great, you made it! In this article, we have discussed multiple scenarios to select rows from a pandas dataframe based on the column values. Thanks.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top