In this article, we will discuss how to select dataframe columns based on conditions i.e. either a single condition or multiple conditions.
Let’s start this with creating a dataframe first,
import pandas as pd # List of Tuples empoyees = [(11, 34, 78, 5, 11, 56), (12, 31, 98, 7, 34, 78), (13, 16, 11, 11, 56, 41) , (89, 41, 12, 12, 78, 78)] # Create a DataFrame object df = pd.DataFrame( empoyees, columns=['A', 'B', 'C', 'D', 'E', 'F']) print("Contents of the Dataframe : ") print(df)
Output:
Contents of the Dataframe : A B C D E F 0 11 34 78 5 11 56 1 12 31 98 7 34 78 2 13 16 11 11 56 41 3 89 41 12 12 78 78
Now we will explore, how to select columns from this dataframe based on conditions on the values of the columns.
Select columns based on conditions in Pandas Dataframe
To select columns based on conditions, we can use the loc[] attribute of the dataframe.
Overview of the loc[]
loc[row_section, column_section]
- row_section: In the row_section pass ‘:’ to include all rows.
- column_section: In the column section pass a bool series, which should be of the same size as number of columns of the dataframe. For True values in the bool series, corresponding columns gets selected.
It returns a dataframe with selected rows & columns based on selection criteria passed in the loc[]. Let’s understand with some examples,
Select dataframe columns based on bool series in the column section
We can pass a bool series in the column section of loc[], where each value in the bool series represents a column of the dataframe. For each True value in the series, loc[] selects the corresponding column. Let’s understand this with an example,
# For each True value in the bool list, select the # respective column from dataframe sub_df = df.loc[: , [True, False, True, False, True, False]] print(sub_df)
Output:
A C E 0 11 78 11 1 12 98 34 2 13 11 56 3 89 12 78
Here, it selected columns ‘A’, ‘C’ and ‘E’ because for those columns corresponding value in the bool sequence was True.
Select dataframe columns which contains the given value
Now, suppose our condition is to select only those columns which has atleast one occurence of 11. To do that we need to create a bool sequence, which should contains the True for columns that has the value 11 and False for others. Then pass that bool sequence to loc[] to select columns which has the value 11 i.e.
# Select columns containing value 11 filter = (df == 11).any() sub_df = df.loc[: , filter] print(sub_df)
Output:
A C D E 0 11 78 5 11 1 12 98 7 34 2 13 11 11 56 3 89 12 12 78
It selected all the columns from dataframe which have the value 11.
Untangling (df == 11).any() :
(df == 11) returns a same sized dataframe containing only bool values. In this bool dataframe, a cell contains True if the corresponding cell in the original dataframe is 11, otherwise it contains False. Then call any() function on this Boolean dataframe object. It looks over the column axis and returns a bool series. Each value in the bool series represents a column and if value is True then it means that column has one or more 11s.
Then we passed that bool sequence to column section of loc[] to select columns with value 11.
Select dataframe columns based on multiple conditions
Using the logic explained in previous example, we can select columns from a dataframe based on multiple condition. For example,
# Select columns which contains any value between 30 to 40 filter = ((df>=30) & (df<=40)).any() sub_df = df.loc[: , filter] print(sub_df)
Output:
B E 0 34 11 1 31 34 2 16 56 3 41 78
It selected all the columns from dataframe which have any value between 30 to 40.
The complete example is as follows,
import pandas as pd # List of Tuples empoyees = [(11, 34, 78, 5, 11, 56), (12, 31, 98, 7, 34, 78), (13, 16, 11, 11, 56, 41) , (89, 41, 12, 12, 78, 78)] # Create a DataFrame object df = pd.DataFrame( empoyees, columns=['A', 'B', 'C', 'D', 'E', 'F']) print("Contents of the Dataframe : ") print(df) # For each True value in the bool list, select the # respective column from dataframe sub_df = df.loc[: , [True, False, True, False, True, False]] print(sub_df) # Select columns containing value 11 filter = (df == 11).any() sub_df = df.loc[: , filter] print(sub_df) # Select columns which contains any value between 30 to 40 filter = ((df>=30) & (df<=40)).any() sub_df = df.loc[: , filter] print(sub_df)
Output:
Contents of the Dataframe : A B C D E F 0 11 34 78 5 11 56 1 12 31 98 7 34 78 2 13 16 11 11 56 41 3 89 41 12 12 78 78 A C E 0 11 78 11 1 12 98 34 2 13 11 56 3 89 12 78 A C D E 0 11 78 5 11 1 12 98 7 34 2 13 11 11 56 3 89 12 12 78 B E 0 34 11 1 31 34 2 16 56 3 41 78
Summary:
We learned how to select columns of a dataframe based on conditions on the column values.
Pandas Tutorials -Learn Data Analysis with Python
-
Pandas Tutorial Part #1 - Introduction to Data Analysis with Python
-
Pandas Tutorial Part #2 - Basics of Pandas Series
-
Pandas Tutorial Part #3 - Get & Set Series values
-
Pandas Tutorial Part #4 - Attributes & methods of Pandas Series
-
Pandas Tutorial Part #5 - Add or Remove Pandas Series elements
-
Pandas Tutorial Part #6 - Introduction to DataFrame
-
Pandas Tutorial Part #7 - DataFrame.loc[] - Select Rows / Columns by Indexing
-
Pandas Tutorial Part #8 - DataFrame.iloc[] - Select Rows / Columns by Label Names
-
Pandas Tutorial Part #9 - Filter DataFrame Rows
-
Pandas Tutorial Part #10 - Add/Remove DataFrame Rows & Columns
-
Pandas Tutorial Part #11 - DataFrame attributes & methods
-
Pandas Tutorial Part #12 - Handling Missing Data or NaN values
-
Pandas Tutorial Part #13 - Iterate over Rows & Columns of DataFrame
-
Pandas Tutorial Part #14 - Sorting DataFrame by Rows or Columns
-
Pandas Tutorial Part #15 - Merging or Concatenating DataFrames
-
Pandas Tutorial Part #16 - DataFrame GroupBy explained with examples
Are you looking to make a career in Data Science with Python?
Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.
Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.
Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.