In this article we will discuss how to use the sum() function of Dataframe to sum the values in a Dataframe along a different axis. We will also discuss all the parameters of the sum() function in detail.
In Pandas, the Dataframe provides a member function sum(), that can be used to get the sum of values in a Dataframe along the requested axis i.e. the sum of values along with columns or along rows in the Dataframe.
Let’s know more about this function,
Syntax of Dataframe.sum()
DataFrame.sum(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)
Parameters:
Frequently Asked:
- axis: The axis along which the sum of values will be calculated.
- 0: To get the sum of values along the index/rows
- 1: To get the sum of values along the columns
- skipna: bool, the default value is True.
- If True then skip NaNs while calculating the sum.
- level: int or level name. The default value is None
- If the axis is Multi-Index, then add items in a given level only
- numeric_only: bool. The default value is None
- If True then include only int, float or Boolean.
- min_count: int. The default value is 0
- Add items only when non-NaN values are equal to or more than min_count.
Returns:
- If no level information is provided or dataframe has only one index, then sum() function returns a series containing the sum of values along the given axis. Whereas, if dataframe is a Multi-Index dataframe and level information is provided then sum() function returns a Dataframe.
Let’s understand this with some examples,
Example 1: Pandas Dataframe.sum() without any parameter
Suppose we have a Dataframe,
import pandas as pd import numpy as np # List of Tuples empSalary = [('jack', 2000, 2010, 2050, 2134, 2111), ('Riti', 3000, 3022, 3456, 3111, 2109), ('Aadi', 4022, np.NaN, 2077, 2134, 3122), ('Mohit', 3012, 3050, 2010, 2122, 1111), ('Veena', 2023, 2232, np.NaN, 2112, 1099), ('Shaun', 2123, 2510, 3050, 3134, 2122), ('Mark', 4000, 2000, 2050, 2122, 2111) ] # Create a DataFrame object emp_salary_df = pd.DataFrame(empSalary, columns=['Name', 'Jan', 'Feb', 'March', 'April', 'May']) emp_salary_df.set_index('Name', inplace=True) print('Dataframe Contents:') print(emp_salary_df)
If we call the sum() function on this Dataframe without any axis parameter, then by default axis value will be 0 and it returns a Series containing the sum of values along the index axis i.e. it will add the values in each column and returns a Series of these values,
Latest Python - Video Tutorial
# Get the sum of values along the default axis i.e. index/rows result = emp_salary_df.sum() print('Series containing sum of values in each column:') print(result)
Output:
Series containing sum of values in each column: Jan 20180.0 Feb 14824.0 March 14693.0 April 16869.0 May 13785.0 dtype: float64
As values were summed up along the index axis i.e. along the rows. So, it returned a Series object where each value in the series represents the sum of values in a column and its index contains the corresponding column Name.
Example 2: Dataframe.sum() with axis value 1
If we pass the axis value 1, then it returns a Series containing the sum of values along the column axis i.e. axis 1. It will add the values in each row and returns a Series of these values,
# Get the sum of values along the axis 1 i.e. columns result = emp_salary_df.sum(axis=1) print('Series containing sum of values in each row:') print(result)
Output:
Series containing sum of values in each row: Name jack 10305.0 Riti 14698.0 Aadi 11355.0 Mohit 11305.0 Veena 7466.0 Shaun 12939.0 Mark 12283.0 dtype: float64
As values were summed up along the axis 1 i.e. along with the columns. It returned a Series object where each value in the series represents the sum of values in a row and its index contains the corresponding row Index Label of Dataframe.
Example 3: Dataframe.sum() without skipping NaN
The default value of skipna parameter is True, so if we call the sum() function without skipna parameter then it skips all the NaN values by default. But if you don’t want to skip NaNs then we can pass the skipna parameter as False i.e.
# Get a Sum of values along default axis (index/rows) # in dataframe without skipping NaNs result = emp_salary_df.sum(skipna=False) print('Series containing sum of values in each column:') print(result)
Output:
Series containing sum of values in each column: Jan 20180.0 Feb NaN March NaN April 16869.0 May 13785.0 dtype: float64
It returned a Series containing sum of values in columns. But for any column if it contains the NaN then sum() returned total as NaN for that particular column. Like in above example ‘Feb’ & ‘March’ columns have NaN values and skipna is False, therefore the sum of values in these columns is NaN too.
Example 4: Dataframe.sum() with min_count
If min_count is provided then it will sum the values in a column or a row only if the minimum non-NaN values are equal or greater than the min_count value. For example,
# Get sum of values in columns if min number # of Non-NaN values are 7 result = emp_salary_df.sum(min_count=7) print('Series containing sum of values in each column:') print(result)
Output:
Series containing sum of values in each column: Jan 20180.0 Feb NaN March NaN April 16869.0 May 13785.0 dtype: float64
Here, columns ‘Feb’ & ‘March’ in dataframe have only 6 non-NaN values, so they didn’t satisfy our criteria of minimum non-NaN values. Therefore the sum of value in these columns was not calculated and NaN is used instead of that,
Ecample 5: Dataframe.sum() with a specific level in Multi-Index Dataframe
Suppose we have a Multi-Index Dataframe,
# List of Tuples empSalary = [('jack', 'Delhi', 2000, 2010, 2050, 2134, 2111), ('Riti', 'Mumbai',3000, 3022, 3456, 3111, 2109), ('Aadi', 'Delhi', 4022, np.NaN, 2077, 2134, 3122), ('Mohit', 'Mumbai',3012, 3050, 2010, 2122, 1111), ('Veena', 'Delhi', 2023, 2232, np.NaN, 2112, 1099), ('Shaun', 'Mumbai',2123, 2510, 3050, 3134, 2122), ('Mark', 'Mumbai',4000, 2000, 2050, 2122, 2111) ] # Create a DataFrame object emp_salary_df = pd.DataFrame(empSalary, columns=['Name', 'City', 'Jan', 'Feb', 'March', 'April', 'May']) emp_salary_df.set_index(['Name', 'City'], inplace=True) print(emp_salary_df)
Output:
Jan Feb March April May Name City jack Delhi 2000 2010.0 2050.0 2134 2111 Riti Mumbai 3000 3022.0 3456.0 3111 2109 Aadi Delhi 4022 NaN 2077.0 2134 3122 Mohit Mumbai 3012 3050.0 2010.0 2122 1111 Veena Delhi 2023 2232.0 NaN 2112 1099 Shaun Mumbai 2123 2510.0 3050.0 3134 2122 Mark Mumbai 4000 2000.0 2050.0 2122 2111
Now we if we provide the level parameter then add the values for that particular level only. For example,
# Get sum of values for a level 'City' only df = emp_salary_df.sum(level='City') print('Summed up values for level "City": ') print(df)
Output:
Summed up values for level "City": Jan Feb March April May City Delhi 8045 4242.0 4127.0 6380 6332 Mumbai 12135 10582.0 10566.0 10489 7453
Out Multi-Index dataframe had two levels i.e. ‘Name’ & ‘City’. We wanted to calculate the sum of values along the index/rows but for one level only i.e. ‘City’. So, we provided the ‘City’ as the level parameter, therefore it returned a Dataframe where index contains the unique values of the index ‘City’ from the original dataframe and columns contain the sum of column values for that particular level only.
Conclusion:
We can use dataframe.sum() to add the values in a dataframe along the different axis and levels. Other parameters in the sum() function gives a lot more control over its behavior.
Latest Video Tutorials