In this article, we will look at different ways to get the column average/mean in the pandas DataFrame. The pandas.DataFrame.mean() function has simplified the things, however, we will cover different scenarios in this tutorial.
Table of Contents
To quickly get started, let’s create a sample dataframe for experimentation. We’ll use the pandas library with some random data, say, the sales for each store at the month level.
import pandas as pd import numpy as np # List of Tuples employees= [('Store1', 20, 10, 5, 5), ('Store2', 69, 47, 37, 32), ('Store3', 33, 100, 121, 30), ('Store4', 47, 31, 22, 22), ('Store5', 89, 90, 11, 15)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Jan', 'Feb', 'March', 'April'], index=[0, 1, 2, 3, 4]) print(df)
Contents of the created dataframe are,
Name Jan Feb March April 0 Store1 20 10 5 5 1 Store2 69 47 37 32 2 Store3 33 100 121 30 3 Store4 47 31 22 22 4 Store5 89 90 11 15
Now, let’s look at different ways in which we could calculate the average/mean of the column.
Get average for all the columns in a DataFrame
To calculate the average for all the columns, we simply need to use the pd.DataFrame.mean() method. Let’s take a look at the code below.
# average for all the columns print (df.mean(axis = 0))
Output
Frequently Asked:
Jan 51.6 Feb 55.6 March 39.2 April 20.8 dtype: float64
Here you go, we have the average of all the columns displayed using the mean() function. We have used axis=0 to get the column-wise mean, for row-wise mean, we can use axis=1.
Get average for a specific column in DataFrame
In cases, where we need to get the average of just a specific column, we will again use the pd.column_name.mean() function as shown below.
# mean for Jan column print (df.Jan.mean())
Output
51.6
Here you go, we have the mean displayed for the Jan column. If we want to include say “Feb” column in this, we can do that by subsetting both the columns together.
# mean for Jan and Feb column print (df[['Jan', 'Feb']].mean())
Output
Jan 51.6 Feb 55.6 dtype: float64
The complete example is as follows,
import pandas as pd import numpy as np # List of Tuples employees= [('Store1', 20, 10, 5, 5), ('Store2', 69, 47, 37, 32), ('Store3', 33, 100, 121, 30), ('Store4', 47, 31, 22, 22), ('Store5', 89, 90, 11, 15)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Jan', 'Feb', 'March', 'April'], index=[0, 1, 2, 3, 4]) print(df) print('** Mean of All Columns **') # average for all the columns print (df.mean(axis = 0)) print('** Mean of a Column **') # mean for Jan column print (df.Jan.mean()) print('** Mean of multiple Columns **') # mean for Jan and Feb column print (df[['Jan', 'Feb']].mean())
Output:
Name Jan Feb March April 0 Store1 20 10 5 5 1 Store2 69 47 37 32 2 Store3 33 100 121 30 3 Store4 47 31 22 22 4 Store5 89 90 11 15 ** Mean of All Columns ** Jan 51.6 Feb 55.6 March 39.2 April 20.8 dtype: float64 ** Mean of a Column ** 51.6 ** Mean of multiple Columns ** Jan 51.6 Feb 55.6 dtype: float64
Get average for column containing missing values
By default, pandas.DataFrame.mean() ignores the missing values (or NaNs) while calculating the averages across the row/column. In case, we want to change the default settings, we can use the skipna argument.
Let’s quickly add some missing values in the DataFrame to experiment.
import pandas as pd import numpy as np # List of Tuples employees= [('Store1', 20, 10, 5, np.NaN), ('Store2', 69, 47, 37, 32), ('Store3', 33, 100, np.NaN, 30), ('Store4', 47, 31, 22, 22), ('Store5', 89, 90, np.NaN, 15)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Jan', 'Feb', 'March', 'April'], index=[0, 1, 2, 3, 4]) print(df)
Output
Name Jan Feb March April 0 Store1 20 10 5.0 NaN 1 Store2 69 47 37.0 32.0 2 Store3 33 100 NaN 30.0 3 Store4 47 31 22.0 22.0 4 Store5 89 90 NaN 15.0
Let’s first try getting the average for all the columns as done previously.
# average for all the columns print (df.mean(axis = 0))
Output
Jan 51.600000 Feb 55.600000 March 21.333333 April 24.750000 dtype: float64
In case, we don’t want to ignore NAs, we will use the skipna argument as below.
# average for all the columns without ignoring NAs print (df.mean(axis = 0, skipna=False))
Output
Jan 51.6 Feb 55.6 March NaN April NaN dtype: float64
As observed, the average for the column “March” and “April” is NaN now, since it contains missing values.
The complete example is as follows,
import pandas as pd import numpy as np # List of Tuples employees= [('Store1', 20, 10, 5, np.NaN), ('Store2', 69, 47, 37, 32), ('Store3', 33, 100, np.NaN, 30), ('Store4', 47, 31, 22, 22), ('Store5', 89, 90, np.NaN, 15)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Jan', 'Feb', 'March', 'April'], index=[0, 1, 2, 3, 4]) print(df) print('** Get average of all columns and skip NaN values **') # average for all the columns print (df.mean(axis = 0)) print('** Get average of all columns without skiping NaN values **') # average for all the columns without ignoring NAs print (df.mean(axis = 0, skipna=False))
Output:
Name Jan Feb March April 0 Store1 20 10 5.0 NaN 1 Store2 69 47 37.0 32.0 2 Store3 33 100 NaN 30.0 3 Store4 47 31 22.0 22.0 4 Store5 89 90 NaN 15.0 ** Get average of all columns and skip NaN values ** Jan 51.600000 Feb 55.600000 March 21.333333 April 24.750000 dtype: float64 ** Get average of all columns without skiping NaN values ** Jan 51.6 Feb 55.6 March NaN April NaN dtype: float64
Summary
Great, you made it! In this article, we have discussed multiple ways to get the column average/mean in the pandas DataFrame. Thanks.