Pandas: Drop dataframe rows based on NaN percentage

In this article, we will discuss how to delete the rows of a dataframe based on NaN percentage, it means by the percentage of missing values the rows contains.

For example, deleting dataframe rows where NaN value are either 25% or more than 25%. Similarly we will build a solution to drop rows which contain more than N% of NaN / missing values.

Table of Contents

We are going to use the pandas dropna() function. So, first let’s have a little overview of it,

Overview of dataframe.dropna()function

Pandas provide a function to delete rows or columns from a dataframe based on NaN values it contains.

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Arguments:

  • axis: Default – 0
    • 0, or ‘index’ : Drop rows which contain NaN values.
    • 1, or ‘columns’ : Drop columns which contain NaN value.
  • how: Default – ‘any’
    • ‘any’ : Drop rows / columns which contain any NaN values.
    • ‘all’ : Drop rows / columns which contain all NaN values.
  • thresh (int): Optional
    • Delete rows/columns which contains less than minimun thresh number of non-NaN values.
  • inplace (bool): Default- False
    • If True, modifies the calling dataframe object

Returns

  • If inplace==True, the return None, else returns a new dataframe by deleting the rows/columns based on NaN values.

Let’s use this to perform our task of deleting rows based on percentage of missing values.

Pandas: Delete dataframe rows based on NaN percentage

Suppose we have a dataframe that contains few rows which has one or more NaN values,

      0     1       2     3
0  Jack  34.0  Sydney   5.0
1  Riti  31.0   Delhi   NaN
2   NaN   NaN     NaN   NaN
3  Aadi  16.0  London  11.0
4  John  31.0     NaN   NaN
5  Kate  31.0     NaN   NaN
6  Mark   NaN     NaN   NaN
7   NaN   NaN     NaN   NaN
8   NaN   NaN     NaN   NaN

Percentage of NaN values in each row is as follows,

  • Row ‘0’: 0% of NaN values.
  • Row ‘1: 25% of NaN values.
  • Row ‘2’: 100% of NaN values.
  • Row ‘3’: 0% of NaN values.
  • Row ‘4’: 50% of NaN values.
  • Row ‘5’: 50% of NaN values.
  • Row ‘6’: 75% of NaN values.
  • Row ‘7’: 100% of NaN values.
  • Row ‘8’: 100% of NaN values.

To delete rows based on percentage of NaN values in rows, we can use a pandas dropna() function. It can delete the columns or rows of a dataframe that contains all or few NaN values. As we want to delete the rows that contains either N% or more than N% of NaN values, so we will pass following arguments in it,

# Delete rows containing either 75% or more than 75% NaN Values
perc = 75.0 # Here N is 75
min_count =  int(((100-perc)/100)*df.shape[1] + 1)
mod_df = df.dropna( axis=0, 
                    thresh=min_count)
  • axis=0 : Drop rows which contain missing value.
  • thresh=min_count : Delete rows which contains less than min_count number of non-NaN values.
    • We calculated this min_count based on percentage of NaN values i.e. N%.

It returned a dataframe after deleting the rows containing either N% or more than N% of NaN values and then we assigned that dataframe to the same variable.

Let’s checkout some examples,

Drop Dataframe rows containing either 75% or more than 75% NaN values

import pandas as pd
import numpy as np

# List of Tuples
empoyees = [('Jack', 34,    'Sydney', 5) ,
            ('Riti', 31,    'Delhi' , np.NaN) ,
            (np.NaN, np.NaN, np.NaN , np.NaN),
            ('Aadi', 16,    'London', 11) ,
            ('John', 31,     np.NaN , np.NaN) ,
            ('Kate', 31,     np.NaN , np.NaN) ,
            ('Mark', np.NaN, np.NaN , np.NaN),
            (np.NaN, np.NaN, np.NaN , np.NaN),
            (np.NaN, np.NaN, np.NaN , np.NaN)]

# Create a DataFrame object
df = pd.DataFrame(  empoyees)

print("Contents of the Dataframe : ")
print(df)

# Delete rows containing either 75% or more than 75% NaN Values
perc = 75.0 
min_count =  int(((100-perc)/100)*df.shape[1] + 1)
mod_df = df.dropna( axis=0, 
                    thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)

Output:

      0     1       2     3
0  Jack  34.0  Sydney   5.0
1  Riti  31.0   Delhi   NaN
2   NaN   NaN     NaN   NaN
3  Aadi  16.0  London  11.0
4  John  31.0     NaN   NaN
5  Kate  31.0     NaN   NaN
6  Mark   NaN     NaN   NaN
7   NaN   NaN     NaN   NaN
8   NaN   NaN     NaN   NaN
Modified Dataframe :
      0     1       2     3
0  Jack  34.0  Sydney   5.0
1  Riti  31.0   Delhi   NaN
3  Aadi  16.0  London  11.0
4  John  31.0     NaN   NaN
5  Kate  31.0     NaN   NaN

It deleted rows with index value ‘2’, ‘6’, ‘7’, ‘8’, because they had either 75% or more than 75% NaN values.

Drop Dataframe rows containing either 90% or more than 90% NaN values

print("Contents of the Dataframe : ")
print(df)

# Delete rows containing either 90% or more than 90% NaN Values
perc = 90.0
min_count =  int(((100-perc)/100)*df.shape[1] + 1)
mod_df = df.dropna( axis=0, 
                    thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)

Output:

Contents of the Dataframe :
      0     1       2     3
0  Jack  34.0  Sydney   5.0
1  Riti  31.0   Delhi   NaN
2   NaN   NaN     NaN   NaN
3  Aadi  16.0  London  11.0
4  John  31.0     NaN   NaN
5  Kate  31.0     NaN   NaN
6  Mark   NaN     NaN   NaN
7   NaN   NaN     NaN   NaN
8   NaN   NaN     NaN   NaN
Modified Dataframe :
      0     1       2     3
0  Jack  34.0  Sydney   5.0
1  Riti  31.0   Delhi   NaN
3  Aadi  16.0  London  11.0
4  John  31.0     NaN   NaN
5  Kate  31.0     NaN   NaN
6  Mark   NaN     NaN   NaN

It deleted rows with index value ‘2’, ‘7’ and ‘8’, because they had more than 90% NaN values.

Drop Dataframe rows containing either 25% or more than 25% NaN values

print("Contents of the Dataframe : ")
print(df)

# Delete rows containing either 25% or more than 25% NaN Values
perc = 25.0
min_count =  int(((100-perc)/100)*df.shape[1] + 1)
mod_df = df.dropna( axis=0, 
                    thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)

Output

Contents of the Dataframe :
      0     1       2     3
0  Jack  34.0  Sydney   5.0
1  Riti  31.0   Delhi   NaN
2   NaN   NaN     NaN   NaN
3  Aadi  16.0  London  11.0
4  John  31.0     NaN   NaN
5  Kate  31.0     NaN   NaN
6  Mark   NaN     NaN   NaN
7   NaN   NaN     NaN   NaN
8   NaN   NaN     NaN   NaN
Modified Dataframe :
      0     1       2     3
0  Jack  34.0  Sydney   5.0
3  Aadi  16.0  London  11.0

It deleted rows with index value ‘1’, ‘2’, ‘4’, ‘5’, ‘6’, ‘7’ and ‘8’, because they had more either 25% or more than 25% NaN values.

The complete example is as follows,

import pandas as pd
import numpy as np

# List of Tuples
empoyees = [('Jack', 34,    'Sydney', 5) ,
            ('Riti', 31,    'Delhi' , np.NaN) ,
            (np.NaN, np.NaN, np.NaN , np.NaN),
            ('Aadi', 16,    'London', 11) ,
            ('John', 31,     np.NaN , np.NaN) ,
            ('Kate', 31,     np.NaN , np.NaN) ,
            ('Mark', np.NaN, np.NaN , np.NaN),
            (np.NaN, np.NaN, np.NaN , np.NaN),
            (np.NaN, np.NaN, np.NaN , np.NaN)]

# Create a DataFrame object
df = pd.DataFrame(  empoyees)

print("Contents of the Dataframe : ")
print(df)

# Delete rows containing either 75% or more than 75% NaN Values
perc = 75.0 
min_count =  int(((100-perc)/100)*df.shape[1] + 1)
mod_df = df.dropna( axis=0, 
                    thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)


print("Contents of the Dataframe : ")
print(df)

# Delete rows containing either 90% or more than 90% NaN Values
perc = 90.0
min_count =  int(((100-perc)/100)*df.shape[1] + 1)
mod_df = df.dropna( axis=0, 
                    thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)


print("Contents of the Dataframe : ")
print(df)

# Delete rows containing either 25% or more than 25% NaN Values
perc = 25.0
min_count =  int(((100-perc)/100)*df.shape[1] + 1)
mod_df = df.dropna( axis=0, 
                    thresh=min_count)

print("Modified Dataframe : ")
print(mod_df)

Output:

Contents of the Dataframe : 
      0     1       2     3
0  Jack  34.0  Sydney   5.0
1  Riti  31.0   Delhi   NaN
2   NaN   NaN     NaN   NaN
3  Aadi  16.0  London  11.0
4  John  31.0     NaN   NaN
5  Kate  31.0     NaN   NaN
6  Mark   NaN     NaN   NaN
7   NaN   NaN     NaN   NaN
8   NaN   NaN     NaN   NaN
Modified Dataframe :
      0     1       2     3
0  Jack  34.0  Sydney   5.0
1  Riti  31.0   Delhi   NaN
3  Aadi  16.0  London  11.0
4  John  31.0     NaN   NaN
5  Kate  31.0     NaN   NaN
Contents of the Dataframe :
      0     1       2     3
0  Jack  34.0  Sydney   5.0
1  Riti  31.0   Delhi   NaN
2   NaN   NaN     NaN   NaN
3  Aadi  16.0  London  11.0
4  John  31.0     NaN   NaN
5  Kate  31.0     NaN   NaN
6  Mark   NaN     NaN   NaN
7   NaN   NaN     NaN   NaN
8   NaN   NaN     NaN   NaN
Modified Dataframe :
      0     1       2     3
0  Jack  34.0  Sydney   5.0
1  Riti  31.0   Delhi   NaN
3  Aadi  16.0  London  11.0
4  John  31.0     NaN   NaN
5  Kate  31.0     NaN   NaN
6  Mark   NaN     NaN   NaN
Contents of the Dataframe :
      0     1       2     3
0  Jack  34.0  Sydney   5.0
1  Riti  31.0   Delhi   NaN
2   NaN   NaN     NaN   NaN
3  Aadi  16.0  London  11.0
4  John  31.0     NaN   NaN
5  Kate  31.0     NaN   NaN
6  Mark   NaN     NaN   NaN
7   NaN   NaN     NaN   NaN
8   NaN   NaN     NaN   NaN
Modified Dataframe :
      0     1       2     3
0  Jack  34.0  Sydney   5.0
3  Aadi  16.0  London  11.0

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top