In this article, we will discuss how to select dataframe columns which contains a given string or a sub-string.

Table of Contents

Let’s start this with creating a dataframe first,

import pandas as pd

# List of Tuples
data = [('FA',  'AA', 'SJ',  'AS', 'B1'),
        ('AL',  'BB', 'KL',  'AX', 'AA'),
        ('AS',  'AC', 'AA',  'AY', 'TT'),
        ('TT',  'AB', 'AB' , 'AZ', 'AX')]


# Create a DataFrame object
df = pd.DataFrame(  data, 
                    columns=['COL1', 'COL2', 'COL3', 'COL4', 'COL5'])

print("Contents of the Dataframe : ")
print(df)

Output:

Contents of the Dataframe : 
  COL1 COL2 COL3 COL4 COL5
0   FA   AA   SJ   AS   B1
1   AL   BB   KL   AX   AA
2   AS   AC   AA   AY   TT
3   TT   AB   AB   AZ   AX

Now we will explore, how to select columns from this dataframe which contains a given string or sub-string

To select columns which contains or do not contains a given string, we can use the loc[] attribute of the dataframe.

Overview of the loc[]

loc[row_section, column_section]
  • row_section: In the row_section pass ‘:’ to include all rows.
  • column_section: In the column section pass a bool series, which should be of the same size as number of columns of the dataframe. For True values in the bool series, corresponding columns gets selected.

It returns a dataframe with selected rows & columns based on selection criteria passed in the loc[]. Let’s understand with some examples,

Select columns containing a string in Pandas Dataframe

Here we want to select only those columns from dataframe, which has atleast one occurence of the given string. To do that we need to create a bool sequence, which should contains the True for columns that has the given string and False for others. Then pass that bool sequence to loc[] to select columns which has the given string i.e.

# Select columns that contains the string 'AA'
sub_df = df.loc[: , (df == 'AA').any()]

print(sub_df)

Output:

  COL2 COL3 COL5
0   AA   SJ   B1
1   BB   KL   AA
2   AC   AA   TT
3   AB   AB   AX

It selected all the columns from dataframe which had the given string.

Untangling (df == ‘AA’).any() :

(df == ‘AA’) returns a same sized dataframe containing only bool values. In this bool dataframe, a cell contains True if the corresponding cell in the original dataframe is ‘AA’, otherwise it contains False. Then call any() function on this Boolean dataframe object. It looks over the column axis and returns a bool series. Each value in the bool series represents a column and if value is True then it means that column has one or more ‘AA’s.

Then we passed that bool sequence to column section of loc[] to select columns with any occurence of ‘AA’.

Pandas: Select dataframe columns not containing the string

Using the logic explained in previous example, we can select columns from a dataframe based which do not contains the given string. For that we just need to negate the filter expression, we created in last example i.e.

# Select columns which do not contains the string 'AA'
sub_df = df.loc[: , (df != 'AA').all()]

print(sub_df)

Output:

  COL1 COL4
0   FA   AS
1   AL   AX
2   AS   AY
3   TT   AZ

It selected all the columns from dataframe which do not have a single occurrence of string ‘AA’.

Select columns a containing sub-string in Pandas Dataframe

To select all those columns from a dataframe which contains a given sub-string, we need to apply a function on each column. Then check if column contains the given sub-string or not, if yes then mark True in the boolean sequence, otherwise False. Then pass this Boolean sequence to loc[] to select all columns containing the given sub-string. For example,

# Get a bool series, where each value represents a columns
# If value is True that means that column contains sub-string X
mask = df.apply(lambda col: col.str.contains('X').any(),
                axis=0)

# Select columns which contains a sub-string 'X'
sub_df = df.loc[: , mask]

print(sub_df)

Output:

  COL4 COL5
0   AS   B1
1   AX   AA
2   AY   TT
3   AZ   AX

It selected all the columns from dataframe which contains a sub-string in values.

The complete example is as follows,

import pandas as pd

# List of Tuples
data = [('FA',  'AA', 'SJ',  'AS', 'B1'),
        ('AL',  'BB', 'KL',  'AX', 'AA'),
        ('AS',  'AC', 'AA',  'AY', 'TT'),
        ('TT',  'AB', 'AB' , 'AZ', 'AX')]


# Create a DataFrame object
df = pd.DataFrame(  data, 
                    columns=['COL1', 'COL2', 'COL3', 'COL4', 'COL5'])

print("Contents of the Dataframe : ")
print(df)

# Select columns that contains the string 'AA'
sub_df = df.loc[: , (df == 'AA').any()]

print(sub_df)


# Select columns which do not contains the string 'AA'
sub_df = df.loc[: , (df != 'AA').all()]

print(sub_df)


# Get a bool series, where each value represents a columns
# If value is True that means that column contains sub-string X
mask = df.apply(lambda col: col.str.contains('X').any(),
                axis=0)

# Select columns which contains a sub-string 'X'
sub_df = df.loc[: , mask]

print(sub_df)

Output:

Contents of the Dataframe : 
  COL1 COL2 COL3 COL4 COL5
0   FA   AA   SJ   AS   B1
1   AL   BB   KL   AX   AA
2   AS   AC   AA   AY   TT
3   TT   AB   AB   AZ   AX
  COL2 COL3 COL5
0   AA   SJ   B1
1   BB   KL   AA
2   AC   AA   TT
3   AB   AB   AX
  COL1 COL4
0   FA   AS
1   AL   AX
2   AS   AY
3   TT   AZ
  COL4 COL5
0   AS   B1
1   AX   AA
2   AY   TT
3   AZ   AX

Summary:

We learned how to select columns of a dataframe which contains a given sub-string.