Pandas: Drop dataframe columns based on NaN percentage

In this article, we will discuss how to delete the columns of a dataframe based on NaN percentage, it means by the percentage of missing values the column contains.

For example, deleting dataframe columns where NaN value are either 25% or more than 25%. Similarly we will build a solution to drop columns which contain more than N% of NaN / missing values.

Table of Contents

We are going to use the pandas dropna() function. So, first let’s have a little overview of it,

Overview of dataframe.dropna()function

Pandas provide a function to delete rows or columns from a dataframe based on NaN values it contains.

Advertisements
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Arguments:

  • axis: Default – 0
    • 0, or ‘index’ : Drop rows which contain NaN values.
    • 1, or ‘columns’ : Drop columns which contain NaN value.
  • how: Default – ‘any’
    • ‘any’ : Drop rows / columns which contain any NaN values.
    • ‘all’ : Drop rows / columns which contain all NaN values.
  • thresh (int): Optional
    • Delete rows/columns which contains less than minimun thresh number of non-NaN values.
  • inplace (bool): Default- False
    • If True, modifies the calling dataframe object

Returns

  • If inplace==True, the return None, else returns a new dataframe by deleting the rows/columns based on NaN values.

Let’s use this to perform our task of deleting columns based on percentage of missing values.

Pandas: Delete dataframe columns based on NaN percentage

Suppose we have a dataframe that contains few columns which has one or more than one NaN values,

      A   B   C       D   E   F   G     H    I
0  Jack NaN  34  Sydney NaN   5 NaN   NaN  NaN
1  Riti NaN  31   Delhi NaN   7 NaN   NaN  NaN
2  Aadi NaN  16  London NaN  11 NaN   3.0  NaN
3  Mark NaN  41   Delhi NaN  12 NaN  11.0  1.0

Percentage of NaN values in each columns is as follows,

  • Column ‘A’: 0% of NaN values.
  • Column ‘B: 100% of NaN values.
  • Column ‘C’: 0% of NaN values.
  • Column ‘D’: 0% of NaN values.
  • Column ‘E’: 100% of NaN values.
  • Column ‘F’: 0% of NaN values.
  • Column ‘G’: 100% of NaN values.
  • Column ‘H’: 50% of NaN values.
  • Column ‘I’: 75% of NaN values.

To delete columns based on percentage of NaN values in columns, we can use a pandas dropna() function. It can delete the columns or rows of a dataframe that contains all or few NaN values. As we want to delete the columns that contains either N% or more than N% of NaN values, so we will pass following arguments in it,

perc = 20.0 # Like N %
min_count =  int(((100-perc)/100)*df.shape[0] + 1)
df = df.dropna( axis=1, 
                thresh=min_count)
  • axis=1 : Drop columns which contain missing value.
  • thresh=min_count : Delete columns which contains less than min_count number of non-NaN values.
    • We calculated this min_count based on percentage of NaN values i.e. N%.

It returned a dataframe after deleting the columns containing either N% or more than N% of NaN values and then we assigned that dataframe to the same variable.

Let’s checkout some examples,

Drop Dataframe columns containing either 75% or more than 75% NaN values

import pandas as pd
import numpy as np

# List of Tuples
empoyees = [('Jack', np.NaN, 34, 'Sydney', np.NaN,  5,  np.NaN, np.NaN, np.NaN),
            ('Riti', 23    , 31, 'Delhi' , np.NaN,  7,  np.NaN, np.NaN, np.NaN),
            ('Aadi', np.NaN, 16, 'London', 32,      11, np.NaN, 3, np.NaN),
            ('Mark', np.NaN, 41, 'Delhi' , np.NaN,  12, np.NaN, 11, 1)]

# Create a DataFrame object
df = pd.DataFrame(  empoyees,
                    columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'])

print("Contents of the Dataframe : ")
print(df)

# Delete columns containing either 75% or more than 75% NaN Values
perc = 75.0
min_count =  int(((100-perc)/100)*df.shape[0] + 1)
mod_df = df.dropna( axis=1, 
                thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)

Output:

Contents of the Dataframe :
      A     B   C       D     E   F   G     H    I
0  Jack   NaN  34  Sydney   NaN   5 NaN   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7 NaN   NaN  NaN
2  Aadi   NaN  16  London  32.0  11 NaN   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12 NaN  11.0  1.0
Modified Dataframe :
      A   C       D   F     H
0  Jack  34  Sydney   5   NaN
1  Riti  31   Delhi   7   NaN
2  Aadi  16  London  11   3.0
3  Mark  41   Delhi  12  11.0

It deleted columns ‘B’, ‘E’, ‘G’, ‘I’ , because they had either 75% or more than 75% NaN values.

Drop Dataframe columns containing either 90% or more than 90% NaN values

print("Contents of the Dataframe : ")
print(df)

# Delete columns containing either 90% or more than 90% NaN Values
perc = 90.0
min_count =  int(((100-perc)/100)*df.shape[0] + 1)
mod_df = df.dropna( axis=1, 
                thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)

Output:

Contents of the Dataframe :
      A     B   C       D     E   F   G     H    I
0  Jack   NaN  34  Sydney   NaN   5 NaN   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7 NaN   NaN  NaN
2  Aadi   NaN  16  London  32.0  11 NaN   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12 NaN  11.0  1.0
Modified Dataframe :
      A     B   C       D     E   F     H    I
0  Jack   NaN  34  Sydney   NaN   5   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7   NaN  NaN
2  Aadi   NaN  16  London  32.0  11   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12  11.0  1.0

It deleted column ‘G’, because it had more than 90% NaN values.

Drop Dataframe columns containing either 25% or more than 25% NaN values

print("Contents of the Dataframe : ")
print(df)

# Delete columns containing either 25% or more than 25% NaN Values
perc = 25.0
min_count =  int(((100-perc)/100)*df.shape[0] + 1)
mod_df = df.dropna( axis=1, 
                thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)

Output

Contents of the Dataframe :
      A     B   C       D     E   F   G     H    I
0  Jack   NaN  34  Sydney   NaN   5 NaN   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7 NaN   NaN  NaN
2  Aadi   NaN  16  London  32.0  11 NaN   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12 NaN  11.0  1.0
Modified Dataframe :
      A   C       D   F
0  Jack  34  Sydney   5
1  Riti  31   Delhi   7
2  Aadi  16  London  11
3  Mark  41   Delhi  12

It deleted columns ‘B’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, because they had either 25% or more than 25% NaN values.

The complete example is as follows,

import pandas as pd
import numpy as np

# List of Tuples
empoyees = [('Jack', np.NaN, 34, 'Sydney', np.NaN,  5,  np.NaN, np.NaN, np.NaN),
            ('Riti', 23    , 31, 'Delhi' , np.NaN,  7,  np.NaN, np.NaN, np.NaN),
            ('Aadi', np.NaN, 16, 'London', 32,      11, np.NaN, 3, np.NaN),
            ('Mark', np.NaN, 41, 'Delhi' , np.NaN,  12, np.NaN, 11, 1)]

# Create a DataFrame object
df = pd.DataFrame(  empoyees,
                    columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'])

print("Contents of the Dataframe : ")
print(df)

# Delete columns containing either 75% or more than 75% NaN Values
perc = 75.0
min_count =  int(((100-perc)/100)*df.shape[0] + 1)
mod_df = df.dropna( axis=1, 
                thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)


print("Contents of the Dataframe : ")
print(df)

# Delete columns containing either 90% or more than 90% NaN Values
perc = 90.0
min_count =  int(((100-perc)/100)*df.shape[0] + 1)
mod_df = df.dropna( axis=1, 
                thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)


print("Contents of the Dataframe : ")
print(df)

# Delete columns containing either 25% or more than 25% NaN Values
perc = 25.0
min_count =  int(((100-perc)/100)*df.shape[0] + 1)
mod_df = df.dropna( axis=1, 
                thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)

Output:

Contents of the Dataframe :
      A     B   C       D     E   F   G     H    I
0  Jack   NaN  34  Sydney   NaN   5 NaN   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7 NaN   NaN  NaN
2  Aadi   NaN  16  London  32.0  11 NaN   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12 NaN  11.0  1.0
Modified Dataframe :
      A   C       D   F     H
0  Jack  34  Sydney   5   NaN
1  Riti  31   Delhi   7   NaN
2  Aadi  16  London  11   3.0
3  Mark  41   Delhi  12  11.0
Contents of the Dataframe :
      A     B   C       D     E   F   G     H    I
0  Jack   NaN  34  Sydney   NaN   5 NaN   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7 NaN   NaN  NaN
2  Aadi   NaN  16  London  32.0  11 NaN   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12 NaN  11.0  1.0
Modified Dataframe :
      A     B   C       D     E   F     H    I
0  Jack   NaN  34  Sydney   NaN   5   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7   NaN  NaN
2  Aadi   NaN  16  London  32.0  11   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12  11.0  1.0
Contents of the Dataframe :
      A     B   C       D     E   F   G     H    I
0  Jack   NaN  34  Sydney   NaN   5 NaN   NaN  NaN
1  Riti  23.0  31   Delhi   NaN   7 NaN   NaN  NaN
2  Aadi   NaN  16  London  32.0  11 NaN   3.0  NaN
3  Mark   NaN  41   Delhi   NaN  12 NaN  11.0  1.0
Modified Dataframe :
      A   C       D   F
0  Jack  34  Sydney   5
1  Riti  31   Delhi   7
2  Aadi  16  London  11
3  Mark  41   Delhi  12

Pandas Tutorials -Learn Data Analysis with Python

   

Are you looking to make a career in Data Science with Python?

Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.

Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.

Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.

Join a LinkedIn Community of Python Developers

Leave a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top