Pandas Tutorial #16 – DataFrame GroupBy

This tutorial will discuss the Group By functionality of DataFrames in Pandas.

Table Of Contents

The Group By mechanism in the Pandas provides a way to break a DataFrame into different groups or chunks based on the values of single or multiple columns.

Let’s understand with some examples. Assume we have a DataFrame,

       Name  Age    City  Experience
ID
11     Jack   44  Sydney          19
12     Riti   41   Delhi          17
13     Aadi   46  Mumbai          11
14    Mohit   45   Delhi          15
15    Veena   43   Delhi          14
16  Shaunak   42  Mumbai          17
17    Manik   42  Sydney          14
18    Vikas   42   Delhi          11
19    Samir   42  Mumbai          15
20  Shobhit   40  Sydney          12

This DataFrame has a column ‘City’ which has three unique values like, “Delhi”, “Mumbai” and “Sydney”. We want to create different groups out of this DataFrame based on the column “City” values. As this column has only three unique values, so there will be three different groups,

Group 1 will contain all the rows for which column “City” has the value “Delhi” i.e.

     Name  Age   City  Experience
ID
12   Riti   41  Delhi          17
14  Mohit   45  Delhi          15
15  Veena   43  Delhi          14
18  Vikas   42  Delhi          11

Group 2 will contain all the rows for which column “City” has the value “Mumbai” i.e.

       Name  Age    City  Experience
ID
13     Aadi   46  Mumbai          11
16  Shaunak   42  Mumbai          17
19    Samir   42  Mumbai          15
Group Name:  Sydney

Group 3 will contain all the rows for which column “City” has the value “Sydney” i.e.

       Name  Age    City  Experience
ID
11     Jack   44  Sydney          19
17    Manik   42  Sydney          14
20  Shobhit   40  Sydney          12

The groupby() method of DataFrame, gives us an iterable object of group Name and contents. We can also select individual groups too. It also provides a way to group large amounts of data and compute operations on these groups. For example, by using the GroupBy mechanism for the above DataFrame, we can get the,

  • Mean Experience of employees for each Group.
  • Mean Age of employees for each Group.
  • Maximum Experience of an employee for each Group.
  • Minimum Experience of an employee for each Group.
  • Maximum Age of an employee for each Group.
  • Minimum Age of an employee for each Group.
  • and many more things….

Let’s see some code examples,

First of all, we will create a DataFrame from a list of tuples,

import pandas as pd

# List of Tuples
empoyees = [(11, 'Jack',    44, 'Sydney',   19) ,
            (12, 'Riti',    41, 'Delhi' ,   17) ,
            (13, 'Aadi',    46, 'Mumbai',   11) ,
            (14, 'Mohit',   45, 'Delhi' ,   15) ,
            (15, 'Veena',   43, 'Delhi' ,   14) ,
            (16, 'Shaunak', 42, 'Mumbai',   17 ),
            (17, 'Manik',   42, 'Sydney',   14 ),
            (18, 'Vikas',   42, 'Delhi',   11 ),
            (19, 'Samir',   42, 'Mumbai',   15 ),
            (20, 'Shobhit', 40, 'Sydney',   12)]

# Create a DataFrame object
df = pd.DataFrame(  empoyees,
                    columns=['ID', 'Name', 'Age', 'City', 'Experience'])

df = df.set_index('ID')

# Display the DataFrame
print(df)

Output:

       Name  Age    City  Experience
ID
11     Jack   44  Sydney          19
12     Riti   41   Delhi          17
13     Aadi   46  Mumbai          11
14    Mohit   45   Delhi          15
15    Veena   43   Delhi          14
16  Shaunak   42  Mumbai          17
17    Manik   42  Sydney          14
18    Vikas   42   Delhi          11
19    Samir   42  Mumbai          15
20  Shobhit   40  Sydney          12

This DataFrame has ten rows and four columns.

DataFrame.groupby() method

DataFrame’s groupby() method accepts column names as arguments. Based on the column values, it creates several groups and returns a DataFrameGroupBy object that contains information about these groups.

For example, let’s create groups based on the column “City”,

# Create Groups based on values in column 'city'
groupObj = df.groupby('City')

print(groupObj)

Output

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002895CA14048>

The groupby() function created three groups because column ‘City’ has three unique values. It returned a DataFrameGroupBy object with information regarding all three groups.

Iterate over all the DataFrame Groups

DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy is an iterable object. It means using a for loop, we can iterate over all the created Groups,

# Iterate over all the groups
for grpName, rows in df.groupby('City'):
    print("Group Name: ", grpName)
    print('Group Content: ')
    print(rows)

Output:

Group Name:  Delhi
Group Content: 
     Name  Age   City  Experience
ID
12   Riti   41  Delhi          17
14  Mohit   45  Delhi          15
15  Veena   43  Delhi          14
18  Vikas   42  Delhi          11
Group Name:  Mumbai
Group Content: 
       Name  Age    City  Experience
ID
13     Aadi   46  Mumbai          11
16  Shaunak   42  Mumbai          17
19    Samir   42  Mumbai          15
Group Name:  Sydney
Group Content: 
       Name  Age    City  Experience
ID
11     Jack   44  Sydney          19
17    Manik   42  Sydney          14
20  Shobhit   40  Sydney          12

We iterated over the DataFrameGroupBy object using a for loop. For each of the groups, it returns a tuple. This tuple contains the group Name and a DataFrame containing the rows in that Group. In our example, the group names were the unique values of “City” Column i.e. “Delhi”, “Mumbai” and “Sydney”. So, basically, a group was created for each city, and the group name is the corresponding city name.

Get first row of each Group

DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function first(), and it returns a DataFrame containing the first row of each of the Group. For example,

# Get first row of each group
firstRowDf = df.groupby('City').first()

print(firstRowDf)

Output:

        Name  Age  Experience
City
Delhi   Riti   41          17
Mumbai  Aadi   46          11
Sydney  Jack   44          19

There were three unique values in the column “City”, therefore 3 groups were created. The first() function fetched the first row of each of the Group and returned a DataFrame populated with that. The returned DataFrame has a row for each of the city and it is the first row from each of the city groups.

Get the count of number of DataFrame Groups

The DataFrameGroupBy object also provides a function size(), and it returns the count of rows in each of the groups created by the groupby() function. For example,

# Get the size of DataFrame groups
print(df.groupby('City').size())

Output:

Delhi     4
Mumbai    3
Sydney    3
dtype: int64

As there were three unique values in the column “City”, therefore 3 groups were created by groupby() function. The size() function returned a Series containing the count of number of rows for each of the group.

Get a specific DataFrame Group by the group name

DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function get_group(). It accepts a group name as an argument and returns a DataFrame containing only rows for that group. For example, for our DataFrame, the groupby(“City”) function created three objects and returned a DataFrameGroupBy object. The group names were the unique values in column “City” i.e. “Delhi”, “Mumbai” and “Sydney”. So, let’s select the rows of Group named “Mumbai”,

# Get a specific group by the group name
specificGroup = df.groupby('City').get_group('Mumbai')

print(specificGroup)

Output:

       Name  Age    City  Experience
ID
13     Aadi   46  Mumbai          11
16  Shaunak   42  Mumbai          17
19    Samir   42  Mumbai          15

It returned a DataFrame containing all the rows for “Mumbai” group.

Statistical operations on the DataFrame GroupBy object

DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function mean(). It returns the mean values of all numeric columns for each Group. For example,

# Get the mean of all numeric columns in each of the group
meanValues = df.groupby('City').mean()

print(meanValues)

Output:

              Age  Experience
City
Delhi   42.750000   14.250000
Mumbai  43.333333   14.333333
Sydney  42.000000   15.000000

Our DataFrame has two numeric columns ‘Age’ and ‘Experience’. The mean() function calculated the mean values for these columns in each of the groups and returned a DataFrame. Basically, this DataFrame contains the mean of employees’ age and Experience of employees in each of the three cities.

If we are interested in mean values of a single column only, then we can first select the column and later call the mean() function. For example,

# Get the mean Age of employees 
# in each of the three cities
meanAge = df.groupby('City')['Age'].mean()

print(meanAge)

Output:

City
Delhi     42.750000
Mumbai    43.333333
Sydney    42.000000
Name: Age, dtype: float64

It returned the mean Age of employees in each of the three cities. Apart from mean() function, the DataFrameGroupBy object also provides many other functions for aggregations.

DataFrame GroupBy and agg() method

DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function agg(), and it returns aggregated values based on the list, dict or callable objects. For example,

Let’s get the mean for all values in column “Age” and the sum of all values in column “Experience” for each of the Group created by the groupby(‘City’) method,

# Get the mean Age column values in each group and 
# Sum of Experience column values in each group
values = df.groupby('City').agg({'Age': 'mean',
                                 'Experience': 'sum'})

print(values)

Output:

              Age  Experience
City
Delhi   42.750000          57
Mumbai  43.333333          43
Sydney  42.000000          45

It returned a DataFrame, which contains the mean Age of all employees in each of the Group and some of the Experience of employees for each of the Group i.e. in each of the unique cities.

We can also provide a list of aggregation functions if we want to perform them on each of the numeric columns for each Group. For example,

# Get the size, sum and mean of values
# of each numeric column in each group
values = df.groupby('City').agg(['size', 'sum', 'mean'])

print(values)

Output:

        Age                 Experience
       size  sum       mean       size sum       mean
City
Delhi     4  171  42.750000          4  57  14.250000
Mumbai    3  130  43.333333          3  43  14.333333
Sydney    3  126  42.000000          3  45  15.000000

It returned a DataFrame, which contains the size, sum and mean of “Age” and “Experience” columns for each of the Group.

We can also pass our custom functions in the list instead of standard aggregation functions like mean() or sum() etc. For example, we want to create a category for each of the DataFrame groups. We grouped by our DataFrame based on column “City,” and now we want to take the mean Experience of each employee in these groups. If the mean of employee’s Experience in Group is greater than or equal to 15, then that Group can be considered as ‘Senior’ group; otherwise, the group’s category will be ‘Junior’. Let’s see a practical example of this,

def group_category(x):
    m = x.mean()
    return 'Seniors' if m >= 15 else 'Juniors'


# Get the mean of Experience column values in each group and 
# employee category for each group based on the mean experience value
values = df.groupby('City')['Experience'].agg(['mean', group_category])

print(values)

Output:

             mean group_category
City
Delhi   14.250000        Juniors
Mumbai  14.333333        Juniors
Sydney  15.000000        Seniors

The mean Experience of employees from Delhi & Mumbai is less than 15; therefore, the category for these groups is ‘Junior’. Whereas the mean Experience of employees from ‘Sydney’ is greater than or equal to 15, the category for this Group is ‘Seniors’. We can also use the lambda function instead of a normal function. For example,

# Get the mean of Experience column values in each group and 
# employee category for each group based on the mean experience value
values = df.groupby('City')['Experience'].agg(
                                    ['mean',
                                    lambda x:   'Seniors'
                                                 if x.mean() >= 15
                                                 else 'Juniors'])

print(values)

Output:

             mean <lambda_0>
City
Delhi   14.250000    Juniors
Mumbai  14.333333    Juniors
Sydney  15.000000    Seniors

The output for this is the same as the previous example. The only difference is that we used a lambda function here to get the category for each of the Group.

Summary:

We learned about the usage of groupby() method in the DataFrame.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top