This tutorial will discuss the Group By functionality of DataFrames in Pandas.
Table Of Contents
- DataFrame.groupby() method
- Iterate over all the DataFrame Groups
- Get first row of each Group
- Get the count of number of DataFrame Groups
- Get a specific DataFrame Group by the group name
- Statistical operations on the DataFrame GroupBy object
- DataFrame GroupBy and agg() method
The Group By mechanism in the Pandas provides a way to break a DataFrame into different groups or chunks based on the values of single or multiple columns.
Let’s understand with some examples. Assume we have a DataFrame,
Name Age City Experience ID 11 Jack 44 Sydney 19 12 Riti 41 Delhi 17 13 Aadi 46 Mumbai 11 14 Mohit 45 Delhi 15 15 Veena 43 Delhi 14 16 Shaunak 42 Mumbai 17 17 Manik 42 Sydney 14 18 Vikas 42 Delhi 11 19 Samir 42 Mumbai 15 20 Shobhit 40 Sydney 12
This DataFrame has a column ‘City’ which has three unique values like, “Delhi”, “Mumbai” and “Sydney”. We want to create different groups out of this DataFrame based on the column “City” values. As this column has only three unique values, so there will be three different groups,
Group 1 will contain all the rows for which column “City” has the value “Delhi” i.e.
Name Age City Experience ID 12 Riti 41 Delhi 17 14 Mohit 45 Delhi 15 15 Veena 43 Delhi 14 18 Vikas 42 Delhi 11
Group 2 will contain all the rows for which column “City” has the value “Mumbai” i.e.
Frequently Asked:
- How to compare two columns in a pandas DataFrame?
- Export Pandas Dataframe to JSON
- Find max column value & return corresponding rows in Pandas
- Replace NaN values with empty string in Pandas
Name Age City Experience ID 13 Aadi 46 Mumbai 11 16 Shaunak 42 Mumbai 17 19 Samir 42 Mumbai 15 Group Name: Sydney
Group 3 will contain all the rows for which column “City” has the value “Sydney” i.e.
Name Age City Experience ID 11 Jack 44 Sydney 19 17 Manik 42 Sydney 14 20 Shobhit 40 Sydney 12
The groupby() method of DataFrame, gives us an iterable object of group Name and contents. We can also select individual groups too. It also provides a way to group large amounts of data and compute operations on these groups. For example, by using the GroupBy mechanism for the above DataFrame, we can get the,
- Mean Experience of employees for each Group.
- Mean Age of employees for each Group.
- Maximum Experience of an employee for each Group.
- Minimum Experience of an employee for each Group.
- Maximum Age of an employee for each Group.
- Minimum Age of an employee for each Group.
- and many more things….
Let’s see some code examples,
First of all, we will create a DataFrame from a list of tuples,
import pandas as pd # List of Tuples empoyees = [(11, 'Jack', 44, 'Sydney', 19) , (12, 'Riti', 41, 'Delhi' , 17) , (13, 'Aadi', 46, 'Mumbai', 11) , (14, 'Mohit', 45, 'Delhi' , 15) , (15, 'Veena', 43, 'Delhi' , 14) , (16, 'Shaunak', 42, 'Mumbai', 17 ), (17, 'Manik', 42, 'Sydney', 14 ), (18, 'Vikas', 42, 'Delhi', 11 ), (19, 'Samir', 42, 'Mumbai', 15 ), (20, 'Shobhit', 40, 'Sydney', 12)] # Create a DataFrame object df = pd.DataFrame( empoyees, columns=['ID', 'Name', 'Age', 'City', 'Experience']) df = df.set_index('ID') # Display the DataFrame print(df)
Output:
Name Age City Experience ID 11 Jack 44 Sydney 19 12 Riti 41 Delhi 17 13 Aadi 46 Mumbai 11 14 Mohit 45 Delhi 15 15 Veena 43 Delhi 14 16 Shaunak 42 Mumbai 17 17 Manik 42 Sydney 14 18 Vikas 42 Delhi 11 19 Samir 42 Mumbai 15 20 Shobhit 40 Sydney 12
This DataFrame has ten rows and four columns.
DataFrame.groupby() method
DataFrame’s groupby() method accepts column names as arguments. Based on the column values, it creates several groups and returns a DataFrameGroupBy object that contains information about these groups.
For example, let’s create groups based on the column “City”,
# Create Groups based on values in column 'city' groupObj = df.groupby('City') print(groupObj)
Output
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002895CA14048>
The groupby() function created three groups because column ‘City’ has three unique values. It returned a DataFrameGroupBy object with information regarding all three groups.
Iterate over all the DataFrame Groups
DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy is an iterable object. It means using a for loop, we can iterate over all the created Groups,
# Iterate over all the groups for grpName, rows in df.groupby('City'): print("Group Name: ", grpName) print('Group Content: ') print(rows)
Output:
Group Name: Delhi Group Content: Name Age City Experience ID 12 Riti 41 Delhi 17 14 Mohit 45 Delhi 15 15 Veena 43 Delhi 14 18 Vikas 42 Delhi 11 Group Name: Mumbai Group Content: Name Age City Experience ID 13 Aadi 46 Mumbai 11 16 Shaunak 42 Mumbai 17 19 Samir 42 Mumbai 15 Group Name: Sydney Group Content: Name Age City Experience ID 11 Jack 44 Sydney 19 17 Manik 42 Sydney 14 20 Shobhit 40 Sydney 12
We iterated over the DataFrameGroupBy object using a for loop. For each of the groups, it returns a tuple. This tuple contains the group Name and a DataFrame containing the rows in that Group. In our example, the group names were the unique values of “City” Column i.e. “Delhi”, “Mumbai” and “Sydney”. So, basically, a group was created for each city, and the group name is the corresponding city name.
Get first row of each Group
DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function first(), and it returns a DataFrame containing the first row of each of the Group. For example,
# Get first row of each group firstRowDf = df.groupby('City').first() print(firstRowDf)
Output:
Name Age Experience City Delhi Riti 41 17 Mumbai Aadi 46 11 Sydney Jack 44 19
There were three unique values in the column “City”, therefore 3 groups were created. The first() function fetched the first row of each of the Group and returned a DataFrame populated with that. The returned DataFrame has a row for each of the city and it is the first row from each of the city groups.
Get the count of number of DataFrame Groups
The DataFrameGroupBy object also provides a function size(), and it returns the count of rows in each of the groups created by the groupby() function. For example,
# Get the size of DataFrame groups print(df.groupby('City').size())
Output:
Delhi 4 Mumbai 3 Sydney 3 dtype: int64
As there were three unique values in the column “City”, therefore 3 groups were created by groupby() function. The size() function returned a Series containing the count of number of rows for each of the group.
Get a specific DataFrame Group by the group name
DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function get_group(). It accepts a group name as an argument and returns a DataFrame containing only rows for that group. For example, for our DataFrame, the groupby(“City”) function created three objects and returned a DataFrameGroupBy object. The group names were the unique values in column “City” i.e. “Delhi”, “Mumbai” and “Sydney”. So, let’s select the rows of Group named “Mumbai”,
# Get a specific group by the group name specificGroup = df.groupby('City').get_group('Mumbai') print(specificGroup)
Output:
Name Age City Experience ID 13 Aadi 46 Mumbai 11 16 Shaunak 42 Mumbai 17 19 Samir 42 Mumbai 15
It returned a DataFrame containing all the rows for “Mumbai” group.
Statistical operations on the DataFrame GroupBy object
DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function mean(). It returns the mean values of all numeric columns for each Group. For example,
# Get the mean of all numeric columns in each of the group meanValues = df.groupby('City').mean() print(meanValues)
Output:
Age Experience City Delhi 42.750000 14.250000 Mumbai 43.333333 14.333333 Sydney 42.000000 15.000000
Our DataFrame has two numeric columns ‘Age’ and ‘Experience’. The mean() function calculated the mean values for these columns in each of the groups and returned a DataFrame. Basically, this DataFrame contains the mean of employees’ age and Experience of employees in each of the three cities.
If we are interested in mean values of a single column only, then we can first select the column and later call the mean() function. For example,
# Get the mean Age of employees # in each of the three cities meanAge = df.groupby('City')['Age'].mean() print(meanAge)
Output:
City Delhi 42.750000 Mumbai 43.333333 Sydney 42.000000 Name: Age, dtype: float64
It returned the mean Age of employees in each of the three cities. Apart from mean() function, the DataFrameGroupBy object also provides many other functions for aggregations.
DataFrame GroupBy and agg() method
DataFrame’s groupby() function returns a DataFrameGroupBy object, which contains the information of all the groups. The DataFrameGroupBy object also provides a function agg(), and it returns aggregated values based on the list, dict or callable objects. For example,
Let’s get the mean for all values in column “Age” and the sum of all values in column “Experience” for each of the Group created by the groupby(‘City’) method,
# Get the mean Age column values in each group and # Sum of Experience column values in each group values = df.groupby('City').agg({'Age': 'mean', 'Experience': 'sum'}) print(values)
Output:
Age Experience City Delhi 42.750000 57 Mumbai 43.333333 43 Sydney 42.000000 45
It returned a DataFrame, which contains the mean Age of all employees in each of the Group and some of the Experience of employees for each of the Group i.e. in each of the unique cities.
We can also provide a list of aggregation functions if we want to perform them on each of the numeric columns for each Group. For example,
# Get the size, sum and mean of values # of each numeric column in each group values = df.groupby('City').agg(['size', 'sum', 'mean']) print(values)
Output:
Age Experience size sum mean size sum mean City Delhi 4 171 42.750000 4 57 14.250000 Mumbai 3 130 43.333333 3 43 14.333333 Sydney 3 126 42.000000 3 45 15.000000
It returned a DataFrame, which contains the size, sum and mean of “Age” and “Experience” columns for each of the Group.
We can also pass our custom functions in the list instead of standard aggregation functions like mean() or sum() etc. For example, we want to create a category for each of the DataFrame groups. We grouped by our DataFrame based on column “City,” and now we want to take the mean Experience of each employee in these groups. If the mean of employee’s Experience in Group is greater than or equal to 15, then that Group can be considered as ‘Senior’ group; otherwise, the group’s category will be ‘Junior’. Let’s see a practical example of this,
def group_category(x): m = x.mean() return 'Seniors' if m >= 15 else 'Juniors' # Get the mean of Experience column values in each group and # employee category for each group based on the mean experience value values = df.groupby('City')['Experience'].agg(['mean', group_category]) print(values)
Output:
mean group_category City Delhi 14.250000 Juniors Mumbai 14.333333 Juniors Sydney 15.000000 Seniors
The mean Experience of employees from Delhi & Mumbai is less than 15; therefore, the category for these groups is ‘Junior’. Whereas the mean Experience of employees from ‘Sydney’ is greater than or equal to 15, the category for this Group is ‘Seniors’. We can also use the lambda function instead of a normal function. For example,
# Get the mean of Experience column values in each group and # employee category for each group based on the mean experience value values = df.groupby('City')['Experience'].agg( ['mean', lambda x: 'Seniors' if x.mean() >= 15 else 'Juniors']) print(values)
Output:
mean <lambda_0> City Delhi 14.250000 Juniors Mumbai 14.333333 Juniors Sydney 15.000000 Seniors
The output for this is the same as the previous example. The only difference is that we used a lambda function here to get the category for each of the Group.
Summary:
We learned about the usage of groupby() method in the DataFrame.