In this article, we will discuss how to select dataframe columns based on conditions i.e. either a single condition or multiple conditions.

Let’s start this with creating a dataframe first,

import pandas as pd

# List of Tuples
empoyees = [(11,  34, 78,  5,  11, 56),
            (12,  31, 98,  7,  34, 78),
            (13,  16, 11,  11, 56, 41) ,
            (89,  41, 12,  12, 78, 78)]


# Create a DataFrame object
df = pd.DataFrame(  empoyees, 
                    columns=['A', 'B', 'C', 'D', 'E', 'F'])

print("Contents of the Dataframe : ")
print(df)

Output:

Contents of the Dataframe : 
    A   B   C   D   E   F
0  11  34  78   5  11  56
1  12  31  98   7  34  78
2  13  16  11  11  56  41
3  89  41  12  12  78  78

Now we will explore, how to select columns from this dataframe based on conditions on the values of the columns.

Select columns based on conditions in Pandas Dataframe

To select columns based on conditions, we can use the loc[] attribute of the dataframe.

Overview of the loc[]

loc[row_section, column_section]
  • row_section: In the row_section pass ‘:’ to include all rows.
  • column_section: In the column section pass a bool series, which should be of the same size as number of columns of the dataframe. For True values in the bool series, corresponding columns gets selected.

It returns a dataframe with selected rows & columns based on selection criteria passed in the loc[]. Let’s understand with some examples,

Select dataframe columns based on bool series in the column section

We can pass a bool series in the column section of loc[], where each value in the bool series represents a column of the dataframe. For each True value in the series, loc[] selects the corresponding column. Let’s understand this with an example,

# For each True value in the bool list, select the 
# respective column from dataframe
sub_df = df.loc[: , [True, False, True, False, True, False]]

print(sub_df)

Output:

    A   C   E
0  11  78  11
1  12  98  34
2  13  11  56
3  89  12  78

Here, it selected columns ‘A’, ‘C’ and ‘E’ because for those columns corresponding value in the bool sequence was True.

Select dataframe columns which contains the given value

Now, suppose our condition is to select only those columns which has atleast one occurence of 11. To do that we need to create a bool sequence, which should contains the True for columns that has the value 11 and False for others. Then pass that bool sequence to loc[] to select columns which has the value 11 i.e.

# Select columns containing value 11
filter = (df == 11).any()
sub_df = df.loc[: , filter]

print(sub_df)

Output:

    A   C   D   E
0  11  78   5  11
1  12  98   7  34
2  13  11  11  56
3  89  12  12  78

It selected all the columns from dataframe which have the value 11.

Untangling (df == 11).any() :

(df == 11) returns a same sized dataframe containing only bool values. In this bool dataframe, a cell contains True if the corresponding cell in the original dataframe is 11, otherwise it contains False. Then call any() function on this Boolean dataframe object. It looks over the column axis and returns a bool series. Each value in the bool series represents a column and if value is True then it means that column has one or more 11s.

Then we passed that bool sequence to column section of loc[] to select columns with value 11.

Select dataframe columns based on multiple conditions

Using the logic explained in previous example, we can select columns from a dataframe based on multiple condition. For example,

# Select columns which contains any value between 30 to 40
filter = ((df>=30) & (df<=40)).any()
sub_df = df.loc[: , filter]

print(sub_df)

Output:

    B   E
0  34  11
1  31  34
2  16  56
3  41  78

It selected all the columns from dataframe which have any value between 30 to 40.

The complete example is as follows,

import pandas as pd

# List of Tuples
empoyees = [(11,  34, 78,  5,  11, 56),
            (12,  31, 98,  7,  34, 78),
            (13,  16, 11,  11, 56, 41) ,
            (89,  41, 12,  12, 78, 78)]


# Create a DataFrame object
df = pd.DataFrame(  empoyees, 
                    columns=['A', 'B', 'C', 'D', 'E', 'F'])

print("Contents of the Dataframe : ")
print(df)

# For each True value in the bool list, select the 
# respective column from dataframe
sub_df = df.loc[: , [True, False, True, False, True, False]]

print(sub_df)


# Select columns containing value 11
filter = (df == 11).any()
sub_df = df.loc[: , filter]

print(sub_df)


# Select columns which contains any value between 30 to 40
filter = ((df>=30) & (df<=40)).any()
sub_df = df.loc[: , filter]

print(sub_df)

Output:

Contents of the Dataframe : 
    A   B   C   D   E   F
0  11  34  78   5  11  56
1  12  31  98   7  34  78
2  13  16  11  11  56  41
3  89  41  12  12  78  78
    A   C   E
0  11  78  11
1  12  98  34
2  13  11  56
3  89  12  78
    A   C   D   E
0  11  78   5  11
1  12  98   7  34
2  13  11  11  56
3  89  12  12  78
    B   E
0  34  11
1  31  34
2  16  56
3  41  78

Summary:

We learned how to select columns of a dataframe based on conditions on the column values.