In this article, we will discuss how to select dataframe columns which contains the NaN values (any, all or None).
Table of contents:
- Select dataframe columns with any NaN values.
- Select dataframe columns with all NaN values.
- Select dataframe columns without a NaN value.
Let’s first create a dataframe and then we will see how to select columns from it based on the NaN values,
import pandas as pd import numpy as np # List of Tuples empoyees = [('Jack', np.NaN, 34, 'Sydney', np.NaN, 5, np.NaN, np.NaN, np.NaN), ('Riti', np.NaN, 31, 'Delhi' , np.NaN, 7, np.NaN, np.NaN, np.NaN), ('Aadi', np.NaN, 16, 'London', np.NaN, 11, np.NaN, 3, np.NaN), ('Mark', np.NaN, 41, 'Delhi' , np.NaN, 12, np.NaN, 11, 1)] # Create a DataFrame object df = pd.DataFrame( empoyees, columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']) print("Contents of the Dataframe : ") print(df)
Output:
Contents of the Dataframe : A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti NaN 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London NaN 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0
Now let’s see how to select columns from the above dataframe containing NaN values,
Pandas: Find Dataframe columns with any NaN value
To select the columns with any NaN value, use the loc[] attribute of the dataframe i.e.
loc[row_section, column_section]
- row_section: In the row_section pass ‘:’ to include all rows.
- column_section: In the column section pass a bool series, which should be of same size as number of columns of the dataframe. If any value in bool series is True then it means that column contains any NaN value.
For example,
# Select dataframe columns with any NaN value subset_df = df.loc[:, df.isnull().any()] print(subset_df)
Output
B E G H I 0 NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN 2 NaN NaN NaN 3.0 NaN 3 NaN NaN NaN 11.0 1.0
It returned a dataframe with only those columns from the original dataframe, which contains any NaN value.
This one-liner solution seems a little complex. So, let’s break this down a little to understand how it is works.
Why did we passed df.isnull().any() in the column section of loc[]?
df.isnull().any() returns a bool series and each value in the bool series represents a column. If any value in bool series is True then it means that corresponding column has any NaN value in it. By passing this in the column_section of loc[], we made sure that loc[] selects only the columns containing any NaN value.
How df.isnull().any() works?
First of all, call isNull() function on the dataframe object, it returns a same sized dataframe containing only bool values. In this bool dataframe, a cell contains True if the corresponding cell in the original dataframe is NaN, otherwise it contains False. Then call any() function on this Boolean dataframe object. It looks over the column axis and returns a bool series. Each value in the bool series represents a column and if value is True then it means that column has any NaN value.
Then pass that bool series to the column section of loc[], it selects only those dataframe columns which has any NaN value.
Pandas: Find Dataframe columns with all NaN values.
To select the columns with all NaN value, use the loc[] attribute of the dataframe i.e.
loc[row_section, column_section]
- row_section: In the row_section pass ‘:’ to include all rows.
- column_section: In the column section pass a bool series, which should be of same size as number of columns of the dataframe. If all value in bool series is True then it means, that respective column contains all NaN values.
For example,
# Select dataframe columns with all NaN values subset_df = df.loc[:, df.isnull().all()] print(subset_df)
Output
B E G 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN
It returned a dataframe with only those columns from the original dataframe, which contains only NaN values.
This one-liner solution seems a little complex. So, let’s break this down a little to understand how it is works.
Why did we passed df.isnull().all() in the column section of loc[]?
df.isnull().all() returns a bool series and each value in the bool series represents a column. If all value in bool series is True then it means that corresponding column has all NaN value in it. By passing this in the column_section of loc[], we made sure that loc[] selects only the columns containing only NaN values.
How df.isnull().all() works?
First of all, call isNull() function on the dataframe object, it returns a same sized dataframe containing only bool values. In this bool dataframe, a cell contains True if the corresponding cell in the original dataframe is NaN, otherwise it contains False. Then call all() function on this Boolean dataframe object. It looks over the column axis and returns a bool series. Each value in the bool series represents a column and if value is True then it means that column has only NaN values in it.
Then pass that bool series to the column section of loc[], it selects only those dataframe columns which has all NaN values.
Pandas – Select Columns Without NaN
To select only those columns from dataframe which do not contain any NaN value, use the loc[] attribute of the dataframe i.e.
loc[row_section, column_section]
- row_section: In the row_section pass ‘:’ to include all rows.
- column_section: In the column section pass a bool series, which should be of same size as number of columns of the dataframe. If all value in bool series is True then it means that column has only non-NaN values.
For example,
# Select columns without a single NaN value subset_df = df.loc[:, ~df.isnull().any()] print(subset_df)
Output
A C D F 0 Jack 34 Sydney 5 1 Riti 31 Delhi 7 2 Aadi 16 London 11 3 Mark 41 Delhi 12
It returned a dataframe with only those columns from the original dataframe, which contains all non-NaN values.
This one-liner solution seems a little complex. So, let’s break this down a little to understand how it is works.
Why did we passed ~df.isnull().any() in the column section of loc[]?
~df.isnull().any() returns a bool series and each value in the bool series represents a column. If any value in bool series is True then it means that corresponding column has no NaN value in it. By passing this in the column_section of loc[], we made sure that loc[] selects only the columns containing all non NaN values.
How ~df.isnull().any() works?
First of all, call isNull() function on the dataframe object, it returns a same sized dataframe containing only bool values. In this bool dataframe, a cell contains True if the corresponding cell in the original dataframe is NaN, otherwise it contains False. Then call any() function on this Boolean dataframe object. It looks over the column axis and returns a bool series. Each value in the bool series represents a column and if value is True then it means that column has atleast NaN value. Then using the ~ operator, we inverted the bool series and if any value is True in series, then it means that column has no NaN value in it.
Then pass that bool series to the column section of loc[], it selects only those dataframe columns which has all non NaN values.
The complete example is as follows,
import pandas as pd import numpy as np # List of Tuples empoyees = [('Jack', np.NaN, 34, 'Sydney', np.NaN, 5, np.NaN, np.NaN, np.NaN), ('Riti', np.NaN, 31, 'Delhi' , np.NaN, 7, np.NaN, np.NaN, np.NaN), ('Aadi', np.NaN, 16, 'London', np.NaN, 11, np.NaN, 3, np.NaN), ('Mark', np.NaN, 41, 'Delhi' , np.NaN, 12, np.NaN, 11, 1)] # Create a DataFrame object df = pd.DataFrame( empoyees, columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']) print("Contents of the Dataframe : ") print(df) # Select dataframe columns with any NaN value subset_df = df.loc[:, df.isnull().any()] print(subset_df) # Select dataframe columns with all NaN values subset_df = df.loc[:, df.isnull().all()] print(subset_df) # Select columns without a single NaN value subset_df = df.loc[:, ~df.isnull().any()] print(subset_df)
Output:
Contents of the Dataframe : A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti NaN 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London NaN 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0 B E G H I 0 NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN 2 NaN NaN NaN 3.0 NaN 3 NaN NaN NaN 11.0 1.0 B E G 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN A C D F 0 Jack 34 Sydney 5 1 Riti 31 Delhi 7 2 Aadi 16 London 11 3 Mark 41 Delhi 12