Pandas: Drop dataframe columns based on NaN percentage

In this article, we will discuss how to delete the columns of a dataframe based on NaN percentage, it means by the percentage of missing values the column contains.

For example, deleting dataframe columns where NaN value are either 25% or more than 25%. Similarly we will build a solution to drop columns which contain more than N% of NaN / missing values.

Table of Contents

We are going to use the pandas dropna() function. So, first let’s have a little overview of it,

Overview of dataframe.dropna()function

Pandas provide a function to delete rows or columns from a dataframe based on NaN values it contains.

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Arguments:

  • axis: Default – 0
    • 0, or ‘index’ : Drop rows which contain NaN values.
    • 1, or ‘columns’ : Drop columns which contain NaN value.
  • how: Default – ‘any’
    • ‘any’ : Drop rows / columns which contain any NaN values.
    • ‘all’ : Drop rows / columns which contain all NaN values.
  • thresh (int): Optional
    • Delete rows/columns which contains less than minimun thresh number of non-NaN values.
  • inplace (bool): Default- False
    • If True, modifies the calling dataframe object

Returns

  • If inplace==True, the return None, else returns a new dataframe by deleting the rows/columns based on NaN values.

Let’s use this to perform our task of deleting columns based on percentage of missing values.

Pandas: Delete dataframe columns based on NaN percentage

Suppose we have a dataframe that contains few columns which has one or more than one NaN values,

      A   B   C       D   E   F   G     H    I
0  Jack NaN  34  Sydney NaN   5 NaN   NaN  NaN
1  Riti NaN  31   Delhi NaN   7 NaN   NaN  NaN
2  Aadi NaN  16  London NaN  11 NaN   3.0  NaN
3  Mark NaN  41   Delhi NaN  12 NaN  11.0  1.0

Percentage of NaN values in each columns is as follows,

  • Column ‘A’: 0% of NaN values.
  • Column ‘B: 100% of NaN values.
  • Column ‘C’: 0% of NaN values.
  • Column ‘D’: 0% of NaN values.
  • Column ‘E’: 100% of NaN values.
  • Column ‘F’: 0% of NaN values.
  • Column ‘G’: 100% of NaN values.
  • Column ‘H’: 50% of NaN values.
  • Column ‘I’: 75% of NaN values.

To delete columns based on percentage of NaN values in columns, we can use a pandas dropna() function. It can delete the columns or rows of a dataframe that contains all or few NaN values. As we want to delete the columns that contains either N% or more than N% of NaN values, so we will pass following arguments in it,

perc = 20.0 # Like N %
min_count =  int(((100-perc)/100)*df.shape[0] + 1)
df = df.dropna( axis=1, 
                thresh=min_count)
  • axis=1 : Drop columns which contain missing value.
  • thresh=min_count : Delete columns which contains less than min_count number of non-NaN values.
    • We calculated this min_count based on percentage of NaN values i.e. N%.

It returned a dataframe after deleting the columns containing either N% or more than N% of NaN values and then we assigned that dataframe to the same variable.

Let’s checkout some examples,

Drop Dataframe columns containing either 75% or more than 75% NaN values

import pandas as pd
import numpy as np

# List of Tuples
empoyees = [('Jack', np.NaN, 34, 'Sydney', np.NaN,  5,  np.NaN, np.NaN, np.NaN),
            ('Riti', 23    , 31, 'Delhi' , np.NaN,  7,  np.NaN, np.NaN, np.NaN),
            ('Aadi', np.NaN, 16, 'London', 32,      11, np.NaN, 3, np.NaN),
            ('Mark', np.NaN, 41, 'Delhi' , np.NaN,  12, np.NaN, 11, 1)]

# Create a DataFrame object
df = pd.DataFrame(  empoyees,
                    columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'])

print("Contents of the Dataframe : ")
print(df)

# Delete columns containing either 75% or more than 75% NaN Values
perc = 75.0
min_count =  int(((100-perc)/100)*df.shape[0] + 1)
mod_df = df.dropna( axis=1, 
                thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)

Output:

Contents of the Dataframe :
      A     B   C       D     E   F   G     H    I
0  Jack   NaN  34  Sydney   NaN   5 NaN   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7 NaN   NaN  NaN
2  Aadi   NaN  16  London  32.0  11 NaN   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12 NaN  11.0  1.0
Modified Dataframe :
      A   C       D   F     H
0  Jack  34  Sydney   5   NaN
1  Riti  31   Delhi   7   NaN
2  Aadi  16  London  11   3.0
3  Mark  41   Delhi  12  11.0

It deleted columns ‘B’, ‘E’, ‘G’, ‘I’ , because they had either 75% or more than 75% NaN values.

Drop Dataframe columns containing either 90% or more than 90% NaN values

print("Contents of the Dataframe : ")
print(df)

# Delete columns containing either 90% or more than 90% NaN Values
perc = 90.0
min_count =  int(((100-perc)/100)*df.shape[0] + 1)
mod_df = df.dropna( axis=1, 
                thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)

Output:

Contents of the Dataframe :
      A     B   C       D     E   F   G     H    I
0  Jack   NaN  34  Sydney   NaN   5 NaN   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7 NaN   NaN  NaN
2  Aadi   NaN  16  London  32.0  11 NaN   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12 NaN  11.0  1.0
Modified Dataframe :
      A     B   C       D     E   F     H    I
0  Jack   NaN  34  Sydney   NaN   5   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7   NaN  NaN
2  Aadi   NaN  16  London  32.0  11   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12  11.0  1.0

It deleted column ‘G’, because it had more than 90% NaN values.

Drop Dataframe columns containing either 25% or more than 25% NaN values

print("Contents of the Dataframe : ")
print(df)

# Delete columns containing either 25% or more than 25% NaN Values
perc = 25.0
min_count =  int(((100-perc)/100)*df.shape[0] + 1)
mod_df = df.dropna( axis=1, 
                thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)

Output

Contents of the Dataframe :
      A     B   C       D     E   F   G     H    I
0  Jack   NaN  34  Sydney   NaN   5 NaN   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7 NaN   NaN  NaN
2  Aadi   NaN  16  London  32.0  11 NaN   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12 NaN  11.0  1.0
Modified Dataframe :
      A   C       D   F
0  Jack  34  Sydney   5
1  Riti  31   Delhi   7
2  Aadi  16  London  11
3  Mark  41   Delhi  12

It deleted columns ‘B’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, because they had either 25% or more than 25% NaN values.

The complete example is as follows,

import pandas as pd
import numpy as np

# List of Tuples
empoyees = [('Jack', np.NaN, 34, 'Sydney', np.NaN,  5,  np.NaN, np.NaN, np.NaN),
            ('Riti', 23    , 31, 'Delhi' , np.NaN,  7,  np.NaN, np.NaN, np.NaN),
            ('Aadi', np.NaN, 16, 'London', 32,      11, np.NaN, 3, np.NaN),
            ('Mark', np.NaN, 41, 'Delhi' , np.NaN,  12, np.NaN, 11, 1)]

# Create a DataFrame object
df = pd.DataFrame(  empoyees,
                    columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'])

print("Contents of the Dataframe : ")
print(df)

# Delete columns containing either 75% or more than 75% NaN Values
perc = 75.0
min_count =  int(((100-perc)/100)*df.shape[0] + 1)
mod_df = df.dropna( axis=1, 
                thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)


print("Contents of the Dataframe : ")
print(df)

# Delete columns containing either 90% or more than 90% NaN Values
perc = 90.0
min_count =  int(((100-perc)/100)*df.shape[0] + 1)
mod_df = df.dropna( axis=1, 
                thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)


print("Contents of the Dataframe : ")
print(df)

# Delete columns containing either 25% or more than 25% NaN Values
perc = 25.0
min_count =  int(((100-perc)/100)*df.shape[0] + 1)
mod_df = df.dropna( axis=1, 
                thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)

Output:

Contents of the Dataframe :
      A     B   C       D     E   F   G     H    I
0  Jack   NaN  34  Sydney   NaN   5 NaN   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7 NaN   NaN  NaN
2  Aadi   NaN  16  London  32.0  11 NaN   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12 NaN  11.0  1.0
Modified Dataframe :
      A   C       D   F     H
0  Jack  34  Sydney   5   NaN
1  Riti  31   Delhi   7   NaN
2  Aadi  16  London  11   3.0
3  Mark  41   Delhi  12  11.0
Contents of the Dataframe :
      A     B   C       D     E   F   G     H    I
0  Jack   NaN  34  Sydney   NaN   5 NaN   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7 NaN   NaN  NaN
2  Aadi   NaN  16  London  32.0  11 NaN   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12 NaN  11.0  1.0
Modified Dataframe :
      A     B   C       D     E   F     H    I
0  Jack   NaN  34  Sydney   NaN   5   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7   NaN  NaN
2  Aadi   NaN  16  London  32.0  11   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12  11.0  1.0
Contents of the Dataframe :
      A     B   C       D     E   F   G     H    I
0  Jack   NaN  34  Sydney   NaN   5 NaN   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7 NaN   NaN  NaN
2  Aadi   NaN  16  London  32.0  11 NaN   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12 NaN  11.0  1.0
Modified Dataframe :
      A   C       D   F
0  Jack  34  Sydney   5
1  Riti  31   Delhi   7
2  Aadi  16  London  11
3  Mark  41   Delhi  12

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top