Get statistics for each group using Pandas GroupBy

In most of the data analysis processes, we need to slice and dice in multiple ways to build data understanding and generate insights. Here, the groupby() function comes in handy to slice the data and generate statistical summary such as mean, count, etc. for other columns. In this article, we will try to cover the different ways to get statistics for each group using the groupby function.

Table of Contents

Intrduction

To quickly get started, let’s create a sample dataframe to experiment. We’ll use the pandas library with some random data.

import pandas as pd

# List of Tuples
employees = [('Shubham', 'Tech', 'India', 25, 5),
            ('Riti', 'Tech', 'USA', 28 ,   7),
            ('Shanky', 'Business', 'USA', 36 ,   2),
            ('Shreya', 'Business', 'India', 25 ,   2),
            ('Aadi', 'Tech', 'USA', 30, 11)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Team', 'Country' ,'Age', 'Experience'],
                  index=[0, 1, 2, 3, 4])
print(df)

Contents of the created dataframe are,

      Name      Team Country  Age  Experience
0  Shubham      Tech   India   25           5
1     Riti      Tech     USA   28           7
2   Shanky  Business     USA   36           2
3   Shreya  Business   India   25           2
4     Aadi      Tech     USA   30          11

Now, let’s look at different ways in which we could generate statistics for each group using the groupby function.

Get statistics for each group based on a single column

Let’s start with the basics to get the statistics of a single group. For instance, we need to get the average experience for each team. Here, we would need to groupby on “Team” column and summarize the “Experience” column.

# get avg. experience by team
print (df.groupby("Team")["Experience"].mean())

Output

Team
Business    2.000000
Tech        7.666667
Name: Experience, dtype: float64

The output shows the average experience of the Business team is 2 years and the Tech team is 7.6 years. Now instead of the mean, we could get all the statistics using the describe() method as shown below.

# get avg. experience by team
print (df.groupby("Team")["Experience"].describe())

Output

          count      mean      std  min  25%  50%  75%   max
Team                                                        
Business    2.0  2.000000  0.00000  2.0  2.0  2.0  2.0   2.0
Tech        3.0  7.666667  3.05505  5.0  6.0  7.0  9.0  11.0

Here, we have got the basic distribution of the Experience column by Team. This is very useful in comparing the groups by mean, median and different percentiles to understand the data.

The complete example is as follows,

import pandas as pd

# List of Tuples
employees = [('Shubham', 'Tech', 'India', 25, 5),
            ('Riti', 'Tech', 'USA', 28 ,   7),
            ('Shanky', 'Business', 'USA', 36 ,   2),
            ('Shreya', 'Business', 'India', 25 ,   2),
            ('Aadi', 'Tech', 'USA', 30, 11)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Team', 'Country' ,'Age', 'Experience'],
                  index=[0, 1, 2, 3, 4])
print(df)

# get avg. experience by team
print (df.groupby("Team")["Experience"].mean())

# get avg. experience by team
print (df.groupby("Team")["Experience"].describe())

Output:

      Name      Team Country  Age  Experience
0  Shubham      Tech   India   25           5
1     Riti      Tech     USA   28           7
2   Shanky  Business     USA   36           2
3   Shreya  Business   India   25           2
4     Aadi      Tech     USA   30          11
Team
Business    2.000000
Tech        7.666667
Name: Experience, dtype: float64
          count      mean      std  min  25%  50%  75%   max
Team                                                        
Business    2.0  2.000000  0.00000  2.0  2.0  2.0  2.0   2.0
Tech        3.0  7.666667  3.05505  5.0  6.0  7.0  9.0  11.0

Get statistics for each group based on multiple columns

For the data analysis purpose, we need to sometimes slice the data by multiple groups. To achieve that, we simply need to mention the column names as list in the groupby method. For instance, let’s group the DataFrame by Team and Country and summarize the Experience across groups.

# get avg. experience by team
print (df.groupby(["Team", "Country"])["Experience"].describe())

Output

                  count  mean       std  min  25%  50%   75%   max
Team     Country                                                  
Business India      1.0   2.0       NaN  2.0  2.0  2.0   2.0   2.0
         USA        1.0   2.0       NaN  2.0  2.0  2.0   2.0   2.0
Tech     India      1.0   5.0       NaN  5.0  5.0  5.0   5.0   5.0
         USA        2.0   9.0  2.828427  7.0  8.0  9.0  10.0  11.0

Note that the std is coming out as NaN since we don’t have more than one data point across that group.

Now instead of summarizing a single column, we want to summarize statistics for multiple columns. To achieve that, we can include the columns in a list as shown below.

# get avg. experience and age by team
print (df.groupby("Team")[["Experience", "Age"]].mean())

Output

          Experience        Age
Team                           
Business    2.000000  30.500000
Tech        7.666667  27.666667

We have the average of both Experience and Age columns by the Team groups.

The complete example is as follows,

import pandas as pd

# List of Tuples
employees = [('Shubham', 'Tech', 'India', 25, 5),
            ('Riti', 'Tech', 'USA', 28 ,   7),
            ('Shanky', 'Business', 'USA', 36 ,   2),
            ('Shreya', 'Business', 'India', 25 ,   2),
            ('Aadi', 'Tech', 'USA', 30, 11)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Team', 'Country' ,'Age', 'Experience'],
                  index=[0, 1, 2, 3, 4])
print(df)

# get avg. experience by team
print (df.groupby(["Team", "Country"])["Experience"].describe())

# get avg. experience and age by team
print (df.groupby("Team")[["Experience", "Age"]].mean())

Output:

      Name      Team Country  Age  Experience
0  Shubham      Tech   India   25           5
1     Riti      Tech     USA   28           7
2   Shanky  Business     USA   36           2
3   Shreya  Business   India   25           2
4     Aadi      Tech     USA   30          11

                  count  mean       std  min  25%  50%   75%   max
Team     Country                                                  
Business India      1.0   2.0       NaN  2.0  2.0  2.0   2.0   2.0
         USA        1.0   2.0       NaN  2.0  2.0  2.0   2.0   2.0
Tech     India      1.0   5.0       NaN  5.0  5.0  5.0   5.0   5.0
         USA        2.0   9.0  2.828427  7.0  8.0  9.0  10.0  11.0

          Experience        Age
Team                           
Business    2.000000  30.500000
Tech        7.666667  27.666667

Get statistics for each group using multiple aggregations

The above example to include multiple columns was simple since the aggregation we wanted to do was same for both the columns. But in case, let’s say, we wanted to take the average of Experience but the median of the Age column by Team. In such cases, the agg function comes in handy.

# get avg. experience and median age by team
print (df.groupby("Team").agg({
    'Experience':'mean',
    'Age': 'median'
}))

Output

          Experience   Age
Team                      
Business    2.000000  30.5
Tech        7.666667  28.0

We could also use multiple aggregations for the same column. For example, let’s calculate the both mean and medium for the Experience column.

# get avg. and median experience, and median age by team
print (df.groupby("Team").agg({
    'Experience':['mean', 'median'],
    'Age': 'median'
}))

Output

         Experience           Age
               mean median median
Team                             
Business   2.000000      2   30.5
Tech       7.666667      7   28.0

The output is a multi-index column header, which we could convert back to DataFrame in case we need to analyze further.

The complete example is as follows,

import pandas as pd

# List of Tuples
employees = [('Shubham', 'Tech', 'India', 25, 5),
            ('Riti', 'Tech', 'USA', 28 ,   7),
            ('Shanky', 'Business', 'USA', 36 ,   2),
            ('Shreya', 'Business', 'India', 25 ,   2),
            ('Aadi', 'Tech', 'USA', 30, 11)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Team', 'Country' ,'Age', 'Experience'],
                  index=[0, 1, 2, 3, 4])
print(df)

# get avg. experience and median age by team
print (df.groupby("Team").agg({
    'Experience':'mean',
    'Age': 'median'
}))

# get avg. and median experience, and median age by team
print (df.groupby("Team").agg({
    'Experience':['mean', 'median'],
    'Age': 'median'
}))

Output:

      Name      Team Country  Age  Experience
0  Shubham      Tech   India   25           5
1     Riti      Tech     USA   28           7
2   Shanky  Business     USA   36           2
3   Shreya  Business   India   25           2
4     Aadi      Tech     USA   30          11
          Experience   Age
Team                      
Business    2.000000  30.5
Tech        7.666667  28.0
         Experience           Age
               mean median median
Team                             
Business   2.000000    2.0   30.5
Tech       7.666667    7.0   28.0

Get statistics for each group using the apply() method

The apply method is also similar to the agg method, but the apply function provides more flexibility to write custom functions to get the statistics by each group. For example, let’s create a custom function to get the mean of Experience for each Team group.

# using the apply method
def get_mean(x):
    return x.mean()

print (df.groupby('Team')['Experience'].apply(get_mean))

Output

Team
Business    2.000000
Tech        7.666667
Name: Experience, dtype: float64

We could define more complicated functions to calculate statistics that are not readily available as an agg function.

Get statistics for each group using the transform() method

Another interesting way to calculate statistics for each group is using the transform method. The advantage of using this method calculates the statistic across each group and then populates it back to all the rows in the order of the original row indices. This is extremely useful when we want to store the statistic in the same DataFrame, as it avoids an additional step to join the statistics back to the original DataFrame.

For example, let’s again take the average of Experience column by Team and store it as a new column “avg_age_by_team” in the DataFrame.

# using the transform method
df['avg_age_by_team'] = df.groupby('Team')['Age'].transform('mean')

print (df)

Output

      Name      Team Country  Age  Experience  avg_age_by_team
0  Shubham      Tech   India   25           5        27.666667
1     Riti      Tech     USA   28           7        27.666667
2   Shanky  Business     USA   36           2        30.500000
3   Shreya  Business   India   25           2        30.500000
4     Aadi      Tech     USA   30          11        27.666667

Summary

Great, you made it! In this article, we have discussed multiple ways to get the statistics of each group using pandas Groupby. Thanks.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top