In this article we will discuss how to replace the NaN values with mean of values in columns or rows using fillna() and mean() methods.
In data analytics we sometimes must fill the missing values using the column mean or row mean to conduct our analysis. Python provides users with built-in methods to rectify the issue of missing values or ‘NaN’ values and clean the data set. These functions are,
Dataframe.fillna()Â
The fillna() method is used to replace the ‘NaN’ in the dataframe. We have discussed the arguments of fillna() in detail in another article.
The mean() method:
mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
Parameters:
- axis : {index (0), columns (1)}
- Axis for the function to be applied on.
- skipna: bool, default True :
- Exclude NA/null values when computing the result.
- level: int or level name, default None:
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
- numeric_only: bool, default None Include only float, int, boolean columns.
- If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
- **kwargs: Additional keyword arguments to be passed to the function.
We will be using the default values of the arguments of the mean() method in this article.
Returns:
Frequently Asked:
- Pandas Tutorial #9 – Filter DataFrame Rows
- Select Rows by column value in Pandas
- Pandas: Select rows with NaN in any column
- Pandas: Delete first column of dataframe in Python
- It returns the average or mean of the values.
Now let’s look at some examples of fillna() along with mean(),
Pandas: Replace NaN with column mean
We can replace the NaN values in a complete dataframe or a particular column with a mean of values in a specific column.
Suppose we have a dataframe that contains the information about 4 students S1 to S4 with marks in different subjects
import numpy as np import pandas as pd # A dictionary with list as values sample_dict = { 'S1': [10, 20, np.NaN, np.NaN], 'S2': [5, np.NaN, np.NaN, 29], 'S3': [15, np.NaN, np.NaN, 11], 'S4': [21, 22, 23, 25], 'Subjects': ['Maths', 'Finance', 'History', 'Geography']} # Create a DataFrame from dictionary df = pd.DataFrame(sample_dict) # Set column 'Subjects' as Index of DataFrame df = df.set_index('Subjects') print(df)
This is the DataFrame that we have created,
S1 S2 S3 S4 Subjects Maths 10.0 5.0 15.0 21 Finance 20.0 NaN NaN 22 History NaN NaN NaN 23 Geography NaN 29.0 11.0 25
If we calculate the mean of values in ‘S2’ column, then a single value of float type is returned
# get mean of values in column S2 mean_value=df['S2'].mean() print('Mean of values in column S2:') print(mean_value)
Output:
Mean of values in column S2: 17.0
Replace NaN values in a column with mean of column values
Now let’s replace the NaN values in column S2 with mean of values in the same column i.e. S2,
# Replace NaNs in column S2 with the # mean of values in the same column df['S2'].fillna(value=df['S2'].mean(), inplace=True) print('Updated Dataframe:') print(df)
Output:
Updated Dataframe: S1 S2 S3 S4 Subjects Maths 10.0 5.0 15.0 21 Finance 20.0 17.0 NaN 22 History NaN 17.0 NaN 23 Geography NaN 29.0 11.0 25
Since the mean() method is called by the ‘S2’ column, therefore value argument had the mean of the ‘S2’ column values. Then ‘NaN’ values in the ‘S2’ column got replaced with the value we got in the ‘value’ argument i.e. the mean of the ‘S2’ column.
Replace all NaN values in a Dataframe with mean of column values
Now if we want to change all the NaN values in the DataFrame with the mean of ‘S2’ we can simply call the fillna() function with the entire dataframe instead of a particular column name. Let me show you what I mean with the example,
# Replace all NaNs in a dataframe with # mean of values in the a column df.fillna(value=df['S2'].mean(), inplace=True) print('Updated Dataframe:') print(df)
Output:
Updated Dataframe: S1 S2 S3 S4 Subjects Maths 10.0 5.0 15.0 21 Finance 20.0 17.0 17.0 22 History 17.0 17.0 17.0 23 Geography 17.0 29.0 11.0 25
Notice that all the values are replaced with the mean on ‘S2’ column values. In the above examples values we used the ‘inplace=True’ to make permanent changes in the dataframe.
We can even use the update() function to make the necessary updates.
df.update(df['S2'].fillna(value=df['S2'].mean(), inplace=True))
The above line will replace the NaNs in column S2 with the mean of values in column S2.
Pandas: Replace NANs with mean of multiple columns
Let’s reinitialize our dataframe with NaN values,
# Create a DataFrame from dictionary df = pd.DataFrame(sample_dict) # Set column 'Subjects' as Index of DataFrame df = df.set_index('Subjects') # Dataframe with NaNs print(df)
Output
S1 S2 S3 S4 Subjects Maths 10.0 5.0 15.0 21 Finance 20.0 NaN NaN 22 History NaN NaN NaN 23 Geography NaN 29.0 11.0 25
Now if we want to work on multiple columns together, we can just specify the list of columns while calling mean() function
# Mean of values in column S2 & S3 mean_values=df[['S2','S3']].mean() print(mean_values)
Output:
S2 17.0 S3 13.0 dtype: float64
It returned a series containing 2 values i.e. mean of values in column S2 & S3.
Now let’s replace the NaN values in the columns ‘S2’ and ‘S3’ by the mean of values in ‘S2’ and ‘S3’ as returned by the mean() method. The ‘value’ attribute has a series of 2 mean values that fill the NaN values respectively in ‘S2’ and ‘S3’ columns. Here ‘value’ is of type ‘Series’,
# Replace the NaNs in column S2 & S3 by the mean of values # in column S2 & S3 respectively df[['S2','S3']] = df[['S2','S3']].fillna(value=df[['S2','S3']].mean()) print('Updated Dataframe:') print(df)
Output:
Updated Dataframe: S1 S2 S3 S4 Subjects Maths 10.0 5.0 15.0 21 Finance 20.0 17.0 13.0 22 History NaN 17.0 13.0 23 Geography NaN 29.0 11.0 25
Pandas: Replace NANs with row mean
We can fill the NaN values with row mean as well. Here the NaN value in ‘Finance’ row will be replaced with the mean of values in ‘Finance’ row. For this we need to use .loc(‘index name’) to access a row and then use fillna() and mean() methods. Here ‘value’ argument contains only 1 value i.e. mean of values in ‘History’ row value and is of type ‘float’
df.loc['History'] = df.loc['History'].fillna(value=df.loc['History'].mean()) print('Updated Dataframe:') print(df)
Output:
Updated Dataframe: S1 S2 S3 S4 Subjects Maths 10.000000 5.0 15.0 21.0 Finance 20.000000 17.0 13.0 22.0 History 17.666667 17.0 13.0 23.0 Geography NaN 29.0 11.0 25.0
Conclusion:
So, these were different ways to replace NaN values in a column, row or complete dataframe with mean or average values.