Handling missing values or NaN (Not A Number) is one of the most important steps in any data analysis project. In this article, we will discuss how to check and count if there is any missing value in a pandas DataFrame.
Table of Contents
Introuction
To quickly get started, let’s create a sample dataframe with missing values for the experimentation. We’ll use the pandas and numpy library with some random data.
import pandas as pd import numpy as np # List of Tuples employees = [('Shubham', 'Data Scientist', 'Sydney', 5), ('Riti', 'Data Analyst', np.NaN , 7), ('Shanky', 'Program Manager', 'Delhi' , 2), ('Shreya', 'Graphic Designer', np.NaN , 2), ('Aadi', 'Data Engineering', 'New York', np.NaN)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Designation', 'City', 'Experience'], index=[0, 1, 2, 3, 4]) print(df)
Contents of the created dataframe are,
Name Designation City Experience 0 Shubham Data Scientist Sydney 5.0 1 Riti Data Analyst NaN 7.0 2 Shanky Program Manager Delhi 2.0 3 Shreya Graphic Designer NaN 2.0 4 Aadi Data Engineering New York NaN
Now, let’s look at different ways in which we could check for missing values.
Check for missing values in Complete DataFrame
There are multiple ways to check if the DataFrame contains any missing values. Let’s quickly explore some of those methods.
# using isnull method print (df.isnull().any().any())
Output
Frequently Asked:
- How to convert dtype ‘object’ to int in Pandas?
- How to normalize columns in Pandas DataFrame?
- Pandas: Select rows with all NaN values in all columns
- GroupBy two columns and get count in Pandas DataFrame
True
Another method,
# using isna method print (df.isna().any(axis = None))
Output
True
Both the methods “isnull()” and “isna()” search for any np.NaN value in the entire DataFrame. Please make sure that you don’t cast your column into a string, as it would convert the missing values into “NaN” string which can’t be tracked as NaN value.
The complete example is as follows,
import pandas as pd import numpy as np # List of Tuples employees = [('Shubham', 'Data Scientist', 'Sydney', 5), ('Riti', 'Data Analyst', np.NaN , 7), ('Shanky', 'Program Manager', 'Delhi' , 2), ('Shreya', 'Graphic Designer', np.NaN , 2), ('Aadi', 'Data Engineering', 'New York', np.NaN)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Designation', 'City', 'Experience'], index=[0, 1, 2, 3, 4]) print(df) # using isnull method print (df.isnull().any().any()) # using isna method print (df.isna().any(axis = None))
Output
Name Designation City Experience 0 Shubham Data Scientist Sydney 5.0 1 Riti Data Analyst NaN 7.0 2 Shanky Program Manager Delhi 2.0 3 Shreya Graphic Designer NaN 2.0 4 Aadi Data Engineering New York NaN True True
Check for missing values in a DataFrame column
In case we need to check the missing values at column level or for any specific column, we could again use the same methods.
# using isnull method print (df.isnull().any()) # or you can use the isna method print (df.isna().any())
Output
Name False Designation False City True Experience True dtype: bool Name False Designation False City True Experience True dtype: bool
Both the methods would generate similar outputs. Now, for instance, we need to check the missing values for a specific column, here is something we can do.
# feel free to use isna as well print (df['City'].isnull().any()) # using hasnans method print (df['City'].hasnans)
Output
True True
Now, within the same column, we need to check exactly which rows contains a missing value.
# feel free to use isna as well print (df['City'].isnull())
Output
0 False 1 True 2 False 3 True 4 False Name: City, dtype: bool
We observe that row indices 1 and 3 contain missing values.
The complete example is as follows,
import pandas as pd import numpy as np # List of Tuples employees = [('Shubham', 'Data Scientist', 'Sydney', 5), ('Riti', 'Data Analyst', np.NaN , 7), ('Shanky', 'Program Manager', 'Delhi' , 2), ('Shreya', 'Graphic Designer', np.NaN , 2), ('Aadi', 'Data Engineering', 'New York', np.NaN)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Designation', 'City', 'Experience'], index=[0, 1, 2, 3, 4]) print(df) # using isnull method print (df.isnull().any()) # or you can use the isna method print (df.isna().any()) # feel free to use isna as well print (df['City'].isnull().any()) # using hasnans method print (df['City'].hasnans) # feel free to use isna as well print (df['City'].isnull())
Output:
Name Designation City Experience 0 Shubham Data Scientist Sydney 5.0 1 Riti Data Analyst NaN 7.0 2 Shanky Program Manager Delhi 2.0 3 Shreya Graphic Designer NaN 2.0 4 Aadi Data Engineering New York NaN Name False Designation False City True Experience True dtype: bool Name False Designation False City True Experience True dtype: bool True True 0 False 1 True 2 False 3 True 4 False Name: City, dtype: bool
Count missing values in a Pandas DataFrame
For the complete data analysis, knowing whether the DataFrame contains any missing value is just part one of the problems. The second part is to understand how much data contains missing values. Let’s try to count the number of missing values in the DataFrame.
# total missing values in the DataFrame print (df.isnull().sum().sum())
Output
3
As noticed, there are three NaN values in the DataFrame. Again, we can use the isna method as well here.
Count missing values in a DataFrame column
We can break down the missing values count at the column level, let’s quickly take a look at the code below.
# total missing values (column-wise) print (df.isnull().sum())
Output
Name 0 Designation 0 City 2 Experience 1 dtype: int64
So the column “City” contains 2 missing values and the column “Experience” contains 1 missing value. We can also convert this to percentages for better interpretation.
# % missing values (column-wise) print (df.isnull().sum()/df.shape[0])
Output
Name 0.0 Designation 0.0 City 0.4 Experience 0.2 dtype: float64
This gives a better interpretation that the column “City” contains 40% missing records and 20% missing for the “Experience” column. We could similarly execute this for any specific column as shown below.
# % missing values (column-wise) print (df['City'].isnull().sum()/df.shape[0])
Output
0.4
The complete example of counting missing values is as follows,
import pandas as pd import numpy as np # List of Tuples employees = [('Shubham', 'Data Scientist', 'Sydney', 5), ('Riti', 'Data Analyst', np.NaN , 7), ('Shanky', 'Program Manager', 'Delhi' , 2), ('Shreya', 'Graphic Designer', np.NaN , 2), ('Aadi', 'Data Engineering', 'New York', np.NaN)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Designation', 'City', 'Experience'], index=[0, 1, 2, 3, 4]) print(df) # total missing values in the DataFrame print (df.isnull().sum().sum()) # total missing values (column-wise) print (df.isnull().sum()) # % missing values (column-wise) print (df.isnull().sum()/df.shape[0]) # % missing values (column-wise) print (df['City'].isnull().sum()/df.shape[0])
Output:
Name Designation City Experience 0 Shubham Data Scientist Sydney 5.0 1 Riti Data Analyst NaN 7.0 2 Shanky Program Manager Delhi 2.0 3 Shreya Graphic Designer NaN 2.0 4 Aadi Data Engineering New York NaN 3 Name 0 Designation 0 City 2 Experience 1 dtype: int64 Name 0.0 Designation 0.0 City 0.4 Experience 0.2 dtype: float64 0.4
Summary
Great, you made it! In this article, we have discussed multiple ways to check the missing values (NaN) in a pandas DataFrame. Thanks.