In most of the data analysis processes, we need to slice and dice in multiple ways to build data understanding and generate insights. Here, the groupby() function comes in handy to slice the data and generate statistical summary such as mean, count, etc. for other columns. In this article, we will try to cover the different ways to get statistics for each group using the groupby function.
Table of Contents
- Intrduction
- Get statistics for each group based on a single column
- Get statistics for each group based on multiple columns
- Get statistics for each group using multiple aggregations
- Get statistics for each group using the apply() method
- Get statistics for each group using the transform() method
- Summary
Intrduction
To quickly get started, let’s create a sample dataframe to experiment. We’ll use the pandas library with some random data.
import pandas as pd # List of Tuples employees = [('Shubham', 'Tech', 'India', 25, 5), ('Riti', 'Tech', 'USA', 28 , 7), ('Shanky', 'Business', 'USA', 36 , 2), ('Shreya', 'Business', 'India', 25 , 2), ('Aadi', 'Tech', 'USA', 30, 11)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Team', 'Country' ,'Age', 'Experience'], index=[0, 1, 2, 3, 4]) print(df)
Contents of the created dataframe are,
Name Team Country Age Experience 0 Shubham Tech India 25 5 1 Riti Tech USA 28 7 2 Shanky Business USA 36 2 3 Shreya Business India 25 2 4 Aadi Tech USA 30 11
Now, let’s look at different ways in which we could generate statistics for each group using the groupby function.
Get statistics for each group based on a single column
Let’s start with the basics to get the statistics of a single group. For instance, we need to get the average experience for each team. Here, we would need to groupby on “Team” column and summarize the “Experience” column.
# get avg. experience by team print (df.groupby("Team")["Experience"].mean())
Output
Team Business 2.000000 Tech 7.666667 Name: Experience, dtype: float64
The output shows the average experience of the Business team is 2 years and the Tech team is 7.6 years. Now instead of the mean, we could get all the statistics using the describe() method as shown below.
# get avg. experience by team print (df.groupby("Team")["Experience"].describe())
Output
count mean std min 25% 50% 75% max Team Business 2.0 2.000000 0.00000 2.0 2.0 2.0 2.0 2.0 Tech 3.0 7.666667 3.05505 5.0 6.0 7.0 9.0 11.0
Here, we have got the basic distribution of the Experience column by Team. This is very useful in comparing the groups by mean, median and different percentiles to understand the data.
The complete example is as follows,
import pandas as pd # List of Tuples employees = [('Shubham', 'Tech', 'India', 25, 5), ('Riti', 'Tech', 'USA', 28 , 7), ('Shanky', 'Business', 'USA', 36 , 2), ('Shreya', 'Business', 'India', 25 , 2), ('Aadi', 'Tech', 'USA', 30, 11)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Team', 'Country' ,'Age', 'Experience'], index=[0, 1, 2, 3, 4]) print(df) # get avg. experience by team print (df.groupby("Team")["Experience"].mean()) # get avg. experience by team print (df.groupby("Team")["Experience"].describe())
Output:
Name Team Country Age Experience 0 Shubham Tech India 25 5 1 Riti Tech USA 28 7 2 Shanky Business USA 36 2 3 Shreya Business India 25 2 4 Aadi Tech USA 30 11 Team Business 2.000000 Tech 7.666667 Name: Experience, dtype: float64 count mean std min 25% 50% 75% max Team Business 2.0 2.000000 0.00000 2.0 2.0 2.0 2.0 2.0 Tech 3.0 7.666667 3.05505 5.0 6.0 7.0 9.0 11.0
Get statistics for each group based on multiple columns
For the data analysis purpose, we need to sometimes slice the data by multiple groups. To achieve that, we simply need to mention the column names as list in the groupby method. For instance, let’s group the DataFrame by Team and Country and summarize the Experience across groups.
# get avg. experience by team print (df.groupby(["Team", "Country"])["Experience"].describe())
Output
count mean std min 25% 50% 75% max Team Country Business India 1.0 2.0 NaN 2.0 2.0 2.0 2.0 2.0 USA 1.0 2.0 NaN 2.0 2.0 2.0 2.0 2.0 Tech India 1.0 5.0 NaN 5.0 5.0 5.0 5.0 5.0 USA 2.0 9.0 2.828427 7.0 8.0 9.0 10.0 11.0
Note that the std is coming out as NaN since we don’t have more than one data point across that group.
Now instead of summarizing a single column, we want to summarize statistics for multiple columns. To achieve that, we can include the columns in a list as shown below.
# get avg. experience and age by team print (df.groupby("Team")[["Experience", "Age"]].mean())
Output
Experience Age Team Business 2.000000 30.500000 Tech 7.666667 27.666667
We have the average of both Experience and Age columns by the Team groups.
The complete example is as follows,
import pandas as pd # List of Tuples employees = [('Shubham', 'Tech', 'India', 25, 5), ('Riti', 'Tech', 'USA', 28 , 7), ('Shanky', 'Business', 'USA', 36 , 2), ('Shreya', 'Business', 'India', 25 , 2), ('Aadi', 'Tech', 'USA', 30, 11)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Team', 'Country' ,'Age', 'Experience'], index=[0, 1, 2, 3, 4]) print(df) # get avg. experience by team print (df.groupby(["Team", "Country"])["Experience"].describe()) # get avg. experience and age by team print (df.groupby("Team")[["Experience", "Age"]].mean())
Output:
Name Team Country Age Experience 0 Shubham Tech India 25 5 1 Riti Tech USA 28 7 2 Shanky Business USA 36 2 3 Shreya Business India 25 2 4 Aadi Tech USA 30 11 count mean std min 25% 50% 75% max Team Country Business India 1.0 2.0 NaN 2.0 2.0 2.0 2.0 2.0 USA 1.0 2.0 NaN 2.0 2.0 2.0 2.0 2.0 Tech India 1.0 5.0 NaN 5.0 5.0 5.0 5.0 5.0 USA 2.0 9.0 2.828427 7.0 8.0 9.0 10.0 11.0 Experience Age Team Business 2.000000 30.500000 Tech 7.666667 27.666667
Get statistics for each group using multiple aggregations
The above example to include multiple columns was simple since the aggregation we wanted to do was same for both the columns. But in case, let’s say, we wanted to take the average of Experience but the median of the Age column by Team. In such cases, the agg function comes in handy.
# get avg. experience and median age by team print (df.groupby("Team").agg({ 'Experience':'mean', 'Age': 'median' }))
Output
Experience Age Team Business 2.000000 30.5 Tech 7.666667 28.0
We could also use multiple aggregations for the same column. For example, let’s calculate the both mean and medium for the Experience column.
# get avg. and median experience, and median age by team print (df.groupby("Team").agg({ 'Experience':['mean', 'median'], 'Age': 'median' }))
Output
Experience Age mean median median Team Business 2.000000 2 30.5 Tech 7.666667 7 28.0
The output is a multi-index column header, which we could convert back to DataFrame in case we need to analyze further.
The complete example is as follows,
import pandas as pd # List of Tuples employees = [('Shubham', 'Tech', 'India', 25, 5), ('Riti', 'Tech', 'USA', 28 , 7), ('Shanky', 'Business', 'USA', 36 , 2), ('Shreya', 'Business', 'India', 25 , 2), ('Aadi', 'Tech', 'USA', 30, 11)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Team', 'Country' ,'Age', 'Experience'], index=[0, 1, 2, 3, 4]) print(df) # get avg. experience and median age by team print (df.groupby("Team").agg({ 'Experience':'mean', 'Age': 'median' })) # get avg. and median experience, and median age by team print (df.groupby("Team").agg({ 'Experience':['mean', 'median'], 'Age': 'median' }))
Output:
Name Team Country Age Experience 0 Shubham Tech India 25 5 1 Riti Tech USA 28 7 2 Shanky Business USA 36 2 3 Shreya Business India 25 2 4 Aadi Tech USA 30 11 Experience Age Team Business 2.000000 30.5 Tech 7.666667 28.0 Experience Age mean median median Team Business 2.000000 2.0 30.5 Tech 7.666667 7.0 28.0
Get statistics for each group using the apply() method
The apply method is also similar to the agg method, but the apply function provides more flexibility to write custom functions to get the statistics by each group. For example, let’s create a custom function to get the mean of Experience for each Team group.
# using the apply method def get_mean(x): return x.mean() print (df.groupby('Team')['Experience'].apply(get_mean))
Output
Team Business 2.000000 Tech 7.666667 Name: Experience, dtype: float64
We could define more complicated functions to calculate statistics that are not readily available as an agg function.
Get statistics for each group using the transform() method
Another interesting way to calculate statistics for each group is using the transform method. The advantage of using this method calculates the statistic across each group and then populates it back to all the rows in the order of the original row indices. This is extremely useful when we want to store the statistic in the same DataFrame, as it avoids an additional step to join the statistics back to the original DataFrame.
For example, let’s again take the average of Experience column by Team and store it as a new column “avg_age_by_team” in the DataFrame.
# using the transform method df['avg_age_by_team'] = df.groupby('Team')['Age'].transform('mean') print (df)
Output
Name Team Country Age Experience avg_age_by_team 0 Shubham Tech India 25 5 27.666667 1 Riti Tech USA 28 7 27.666667 2 Shanky Business USA 36 2 30.500000 3 Shreya Business India 25 2 30.500000 4 Aadi Tech USA 30 11 27.666667
Summary
Great, you made it! In this article, we have discussed multiple ways to get the statistics of each group using pandas Groupby. Thanks.