In this article, we will discuss how to delete the columns of a dataframe based on NaN percentage, it means by the percentage of missing values the column contains.

For example, deleting dataframe columns where NaN value are either 25% or more than 25%. Similarly we will build a solution to drop columns which contain more than N% of NaN / missing values.

**Table of Contents**

We are going to use the pandas dropna() function. So, first let’s have a little overview of it,

## Overview of dataframe.dropna()function

Pandas provide a function to delete rows or columns from a dataframe based on NaN values it contains.

**Advertisements**

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

**Arguments:**

**axis**: Default – 0- 0, or â€˜indexâ€™ : Drop rows which contain NaN values.
- 1, or â€˜columnsâ€™ : Drop columns which contain NaN value.

**how**: Default – ‘any’- â€˜anyâ€™ : Drop rows / columns which contain any NaN values.
- â€˜allâ€™ : Drop rows / columns which contain all NaN values.

**thresh**(int): Optional- Delete rows/columns which contains less than minimun thresh number of non-NaN values.

**inplace**(bool): Default- False- If True, modifies the calling dataframe object

**Returns**

- If inplace==True, the return None, else returns a new dataframe by deleting the rows/columns based on NaN values.

Let’s use this to perform our task of deleting columns based on percentage of missing values.

## Pandas: Delete dataframe columns based on NaN percentage

Suppose we have a dataframe that contains few columns which has one or more than one NaN values,

A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti NaN 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London NaN 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0

Percentage of NaN values in each columns is as follows,

**Column ‘A’:**0% of NaN values.**Column ‘B:**100% of NaN values.**Column ‘C’:**0% of NaN values.**Column ‘D’:**0% of NaN values.**Column ‘E’:**100% of NaN values.**Column ‘F’:**0% of NaN values.**Column ‘G’:**100% of NaN values.**Column ‘H’:**50% of NaN values.**Column ‘I’:**75% of NaN values.

To delete columns based on percentage of NaN values in columns, we can use a pandas dropna() function. It can delete the columns or rows of a dataframe that contains all or few NaN values. As we want to delete the columns that contains either N% or more than N% of NaN values, so we will pass following arguments in it,

perc = 20.0 # Like N % min_count = int(((100-perc)/100)*df.shape[0] + 1) df = df.dropna( axis=1, thresh=min_count)

**axis=1 :**Drop columns which contain missing value.**thresh=min_count**: Delete columns which contains less than**min_count**number of non-NaN values.- We calculated this min_count based on percentage of NaN values i.e. N%.

It returned a dataframe after deleting the columns containing either N% or more than N% of NaN values and then we assigned that dataframe to the same variable.

**Let’s checkout some examples,**

## Drop Dataframe columns containing either 75% or more than 75% NaN values

import pandas as pd import numpy as np # List of Tuples empoyees = [('Jack', np.NaN, 34, 'Sydney', np.NaN, 5, np.NaN, np.NaN, np.NaN), ('Riti', 23 , 31, 'Delhi' , np.NaN, 7, np.NaN, np.NaN, np.NaN), ('Aadi', np.NaN, 16, 'London', 32, 11, np.NaN, 3, np.NaN), ('Mark', np.NaN, 41, 'Delhi' , np.NaN, 12, np.NaN, 11, 1)] # Create a DataFrame object df = pd.DataFrame( empoyees, columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']) print("Contents of the Dataframe : ") print(df) # Delete columns containing either 75% or more than 75% NaN Values perc = 75.0 min_count = int(((100-perc)/100)*df.shape[0] + 1) mod_df = df.dropna( axis=1, thresh=min_count) print("Modified Dataframe : ") print(mod_df)

**Output:**

Contents of the Dataframe : A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London 32.0 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0 Modified Dataframe : A C D F H 0 Jack 34 Sydney 5 NaN 1 Riti 31 Delhi 7 NaN 2 Aadi 16 London 11 3.0 3 Mark 41 Delhi 12 11.0

It deleted columns ‘B’, ‘E’, ‘G’, ‘I’ , because they had either 75% or more than 75% NaN values.

## Drop Dataframe columns containing either 90% or more than 90% NaN values

print("Contents of the Dataframe : ") print(df) # Delete columns containing either 90% or more than 90% NaN Values perc = 90.0 min_count = int(((100-perc)/100)*df.shape[0] + 1) mod_df = df.dropna( axis=1, thresh=min_count) print("Modified Dataframe : ") print(mod_df)

Output:

Contents of the Dataframe : A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London 32.0 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0 Modified Dataframe : A B C D E F H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN 2 Aadi NaN 16 London 32.0 11 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 11.0 1.0

It deleted column ‘G’, because it had more than 90% NaN values.

## Drop Dataframe columns containing either 25% or more than 25% NaN values

print("Contents of the Dataframe : ") print(df) # Delete columns containing either 25% or more than 25% NaN Values perc = 25.0 min_count = int(((100-perc)/100)*df.shape[0] + 1) mod_df = df.dropna( axis=1, thresh=min_count) print("Modified Dataframe : ") print(mod_df)

**Output**

Contents of the Dataframe : A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London 32.0 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0 Modified Dataframe : A C D F 0 Jack 34 Sydney 5 1 Riti 31 Delhi 7 2 Aadi 16 London 11 3 Mark 41 Delhi 12

It deleted columns ‘B’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, because they had either 25% or more than 25% NaN values.

**The complete example is as follows,**

import pandas as pd import numpy as np # List of Tuples empoyees = [('Jack', np.NaN, 34, 'Sydney', np.NaN, 5, np.NaN, np.NaN, np.NaN), ('Riti', 23 , 31, 'Delhi' , np.NaN, 7, np.NaN, np.NaN, np.NaN), ('Aadi', np.NaN, 16, 'London', 32, 11, np.NaN, 3, np.NaN), ('Mark', np.NaN, 41, 'Delhi' , np.NaN, 12, np.NaN, 11, 1)] # Create a DataFrame object df = pd.DataFrame( empoyees, columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']) print("Contents of the Dataframe : ") print(df) # Delete columns containing either 75% or more than 75% NaN Values perc = 75.0 min_count = int(((100-perc)/100)*df.shape[0] + 1) mod_df = df.dropna( axis=1, thresh=min_count) print("Modified Dataframe : ") print(mod_df) print("Contents of the Dataframe : ") print(df) # Delete columns containing either 90% or more than 90% NaN Values perc = 90.0 min_count = int(((100-perc)/100)*df.shape[0] + 1) mod_df = df.dropna( axis=1, thresh=min_count) print("Modified Dataframe : ") print(mod_df) print("Contents of the Dataframe : ") print(df) # Delete columns containing either 25% or more than 25% NaN Values perc = 25.0 min_count = int(((100-perc)/100)*df.shape[0] + 1) mod_df = df.dropna( axis=1, thresh=min_count) print("Modified Dataframe : ") print(mod_df)

**Output:**

Contents of the Dataframe : A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London 32.0 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0 Modified Dataframe : A C D F H 0 Jack 34 Sydney 5 NaN 1 Riti 31 Delhi 7 NaN 2 Aadi 16 London 11 3.0 3 Mark 41 Delhi 12 11.0 Contents of the Dataframe : A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London 32.0 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0 Modified Dataframe : A B C D E F H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN 2 Aadi NaN 16 London 32.0 11 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 11.0 1.0 Contents of the Dataframe : A B C D E F G H I 0 Jack NaN 34 Sydney NaN 5 NaN NaN NaN 1 Riti 23.0 31 Delhi NaN 7 NaN NaN NaN 2 Aadi NaN 16 London 32.0 11 NaN 3.0 NaN 3 Mark NaN 41 Delhi NaN 12 NaN 11.0 1.0 Modified Dataframe : A C D F 0 Jack 34 Sydney 5 1 Riti 31 Delhi 7 2 Aadi 16 London 11 3 Mark 41 Delhi 12

**Advertisements**

Thanks for reading.