In this article, we will discuss how to delete the columns of a dataframe based on NaN percentage, it means by the percentage of missing values the column contains.
For example, deleting dataframe columns where NaN value are either 25% or more than 25%. Similarly we will build a solution to drop columns which contain more than N% of NaN / missing values.
Table of Contents
We are going to use the pandas dropna() function. So, first let’s have a little overview of it,
Overview of dataframe.dropna()function
Pandas provide a function to delete rows or columns from a dataframe based on NaN values it contains.
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
Arguments:
- axis: Default – 0
- 0, or ‘index’ : Drop rows which contain NaN values.
- 1, or ‘columns’ : Drop columns which contain NaN value.
- how: Default – ‘any’
- ‘any’ : Drop rows / columns which contain any NaN values.
- ‘all’ : Drop rows / columns which contain all NaN values.
- thresh (int): Optional
- Delete rows/columns which contains less than minimun thresh number of non-NaN values.
- inplace (bool): Default- False
- If True, modifies the calling dataframe object
Returns
Frequently Asked:
- Combine two Series into a DataFrame in Pandas
- Pandas – Find Columns with NaN
- Pandas: Get sum of column values in a Dataframe
- Change Column Names in Pandas Dataframe
- If inplace==True, the return None, else returns a new dataframe by deleting the rows/columns based on NaN values.
Let’s use this to perform our task of deleting columns based on percentage of missing values.
Pandas: Delete dataframe columns based on NaN percentage
Suppose we have a dataframe that contains few columns which has one or more than one NaN values,
A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti NaN 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London NaN 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0
Percentage of NaN values in each columns is as follows,
- Column ‘A’: 0% of NaN values.
- Column ‘B: 100% of NaN values.
- Column ‘C’: 0% of NaN values.
- Column ‘D’: 0% of NaN values.
- Column ‘E’: 100% of NaN values.
- Column ‘F’: 0% of NaN values.
- Column ‘G’: 100% of NaN values.
- Column ‘H’: 50% of NaN values.
- Column ‘I’: 75% of NaN values.
To delete columns based on percentage of NaN values in columns, we can use a pandas dropna() function. It can delete the columns or rows of a dataframe that contains all or few NaN values. As we want to delete the columns that contains either N% or more than N% of NaN values, so we will pass following arguments in it,
perc = 20.0 # Like N % min_count = int(((100-perc)/100)*df.shape[0] + 1) df = df.dropna( axis=1, thresh=min_count)
- axis=1 : Drop columns which contain missing value.
- thresh=min_count : Delete columns which contains less than min_count number of non-NaN values.
- We calculated this min_count based on percentage of NaN values i.e. N%.
It returned a dataframe after deleting the columns containing either N% or more than N% of NaN values and then we assigned that dataframe to the same variable.
Let’s checkout some examples,
Drop Dataframe columns containing either 75% or more than 75% NaN values
import pandas as pd import numpy as np # List of Tuples empoyees = [('Jack', np.NaN, 34, 'Sydney', np.NaN, 5, np.NaN, np.NaN, np.NaN), ('Riti', 23 , 31, 'Delhi' , np.NaN, 7, np.NaN, np.NaN, np.NaN), ('Aadi', np.NaN, 16, 'London', 32, 11, np.NaN, 3, np.NaN), ('Mark', np.NaN, 41, 'Delhi' , np.NaN, 12, np.NaN, 11, 1)] # Create a DataFrame object df = pd.DataFrame( empoyees, columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']) print("Contents of the Dataframe : ") print(df) # Delete columns containing either 75% or more than 75% NaN Values perc = 75.0 min_count = int(((100-perc)/100)*df.shape[0] + 1) mod_df = df.dropna( axis=1, thresh=min_count) print("Modified Dataframe : ") print(mod_df)
Output:
Contents of the Dataframe : A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London 32.0 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0 Modified Dataframe : A C D F H 0 Jack 34 Sydney 5 NaN 1 Riti 31 Delhi 7 NaN 2 Aadi 16 London 11 3.0 3 Mark 41 Delhi 12 11.0
It deleted columns ‘B’, ‘E’, ‘G’, ‘I’ , because they had either 75% or more than 75% NaN values.
Drop Dataframe columns containing either 90% or more than 90% NaN values
print("Contents of the Dataframe : ") print(df) # Delete columns containing either 90% or more than 90% NaN Values perc = 90.0 min_count = int(((100-perc)/100)*df.shape[0] + 1) mod_df = df.dropna( axis=1, thresh=min_count) print("Modified Dataframe : ") print(mod_df)
Output:
Contents of the Dataframe : A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London 32.0 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0 Modified Dataframe : A B C D E F H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN 2 Aadi NaN 16 London 32.0 11 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 11.0 1.0
It deleted column ‘G’, because it had more than 90% NaN values.
Drop Dataframe columns containing either 25% or more than 25% NaN values
print("Contents of the Dataframe : ") print(df) # Delete columns containing either 25% or more than 25% NaN Values perc = 25.0 min_count = int(((100-perc)/100)*df.shape[0] + 1) mod_df = df.dropna( axis=1, thresh=min_count) print("Modified Dataframe : ") print(mod_df)
Output
Contents of the Dataframe : A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London 32.0 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0 Modified Dataframe : A C D F 0 Jack 34 Sydney 5 1 Riti 31 Delhi 7 2 Aadi 16 London 11 3 Mark 41 Delhi 12
It deleted columns ‘B’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, because they had either 25% or more than 25% NaN values.
The complete example is as follows,
import pandas as pd import numpy as np # List of Tuples empoyees = [('Jack', np.NaN, 34, 'Sydney', np.NaN, 5, np.NaN, np.NaN, np.NaN), ('Riti', 23 , 31, 'Delhi' , np.NaN, 7, np.NaN, np.NaN, np.NaN), ('Aadi', np.NaN, 16, 'London', 32, 11, np.NaN, 3, np.NaN), ('Mark', np.NaN, 41, 'Delhi' , np.NaN, 12, np.NaN, 11, 1)] # Create a DataFrame object df = pd.DataFrame( empoyees, columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']) print("Contents of the Dataframe : ") print(df) # Delete columns containing either 75% or more than 75% NaN Values perc = 75.0 min_count = int(((100-perc)/100)*df.shape[0] + 1) mod_df = df.dropna( axis=1, thresh=min_count) print("Modified Dataframe : ") print(mod_df) print("Contents of the Dataframe : ") print(df) # Delete columns containing either 90% or more than 90% NaN Values perc = 90.0 min_count = int(((100-perc)/100)*df.shape[0] + 1) mod_df = df.dropna( axis=1, thresh=min_count) print("Modified Dataframe : ") print(mod_df) print("Contents of the Dataframe : ") print(df) # Delete columns containing either 25% or more than 25% NaN Values perc = 25.0 min_count = int(((100-perc)/100)*df.shape[0] + 1) mod_df = df.dropna( axis=1, thresh=min_count) print("Modified Dataframe : ") print(mod_df)
Output:
Contents of the Dataframe : A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London 32.0 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0 Modified Dataframe : A C D F H 0 Jack 34 Sydney 5 NaN 1 Riti 31 Delhi 7 NaN 2 Aadi 16 London 11 3.0 3 Mark 41 Delhi 12 11.0 Contents of the Dataframe : A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London 32.0 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0 Modified Dataframe : A B C D E F H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN 2 Aadi NaN 16 London 32.0 11 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 11.0 1.0 Contents of the Dataframe : A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London 32.0 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0 Modified Dataframe : A C D F 0 Jack 34 Sydney 5 1 Riti 31 Delhi 7 2 Aadi 16 London 11 3 Mark 41 Delhi 12