pandas groupby: Split, Apply, Combine
The pandas groupby() operation is one of the most powerful features for data analysis. It lets you split data into groups based on some criteria, apply a function to each group independently, and combine the results back together. This “split-apply-combine” pattern is fundamental to data analysis.
This guide walks you through everything you need to know about groupby in pandas.
Installing and Importing pandas
If you haven’t installed pandas yet:
pip install pandas
Import pandas in your code:
import pandas as pd
import numpy as np
Understanding the Split-Apply-Combine Pattern
The groupby operation follows three steps:
- Split — Break data into groups based on some criteria
- Apply — Run calculations on each group independently
- Combine — Merge results back into a single DataFrame
Here’s a simple example:
import pandas as pd
# Sample data
df = pd.DataFrame({
"department": ["Sales", "Sales", "Engineering", "Engineering", "HR"],
"employee": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"salary": [55000, 72000, 85000, 90000, 50000]
})
print(df)
# department employee salary
# 0 Sales Alice 55000
# 1 Sales Bob 72000
# 2 Engineering Charlie 85000
# 3 Engineering Diana 90000
# 4 HR Eve 50000
# Group by department
grouped = df.groupby("department")
# Apply mean salary to each group
print(grouped["salary"].mean())
# department
# Engineering 87500.0
# HR 50000.0
# Sales 63500.0
# Name: salary, dtype: float64
Basic GroupBy Operations
Grouping by One Column
The simplest form of groupby:
import pandas as pd
df = pd.DataFrame({
"category": ["A", "A", "B", "B", "C"],
"value": [10, 20, 30, 40, 50]
})
# Group by category
grouped = df.groupby("category")
# Different aggregation functions
print(grouped["value"].sum()) # Sum per group
print(grouped["value"].mean()) # Mean per group
print(grouped["value"].count()) # Count per group
print(grouped["value"].min()) # Min per group
print(grouped["value"].max()) # Max per group
print(grouped["value"].std()) # Standard deviation
print(grouped["value"].first()) # First value
print(grouped["value"].last()) # Last value
Grouping by Multiple Columns
Pass multiple column names to group by combinations:
import pandas as pd
df = pd.DataFrame({
"region": ["North", "North", "South", "South", "North", "South"],
"product": ["X", "Y", "X", "Y", "X", "Y"],
"sales": [100, 150, 200, 180, 120, 190]
})
# Group by region AND product
grouped = df.groupby(["region", "product"])
print(grouped["sales"].sum())
# region product
# North X 220
# Y 150
# South X 200
# Y 370
# Name: sales, dtype: int64
Using as_index=False
Keep the grouped columns as regular columns:
import pandas as pd
df = pd.DataFrame({
"department": ["Sales", "Sales", "Engineering", "Engineering"],
"salary": [55000, 72000, 85000, 90000]
})
# as_index=False keeps department as a column
result = df.groupby("department", as_index=False)["salary"].mean()
print(result)
# department salary
# 0 Engineering 87500.0
# 1 Sales 63500.0
The agg() Method
The agg() method lets you apply multiple aggregation functions at once:
import pandas as pd
df = pd.DataFrame({
"department": ["Sales", "Sales", "Engineering", "Engineering", "HR"],
"salary": [55000, 72000, 85000, 90000, 50000],
"bonus": [5000, 7000, 8000, 9000, 4000]
})
# Single function as string
print(df.groupby("department")["salary"].agg("mean"))
# Multiple functions as list
print(df.groupby("department")["salary"].agg(["mean", "sum", "count"]))
# Different functions per column
print(df.groupby("department").agg({
"salary": ["mean", "max"],
"bonus": ["sum", "min"]
}))
Named Aggregations
Create descriptive column names:
import pandas as pd
df = pd.DataFrame({
"category": ["A", "A", "B", "B"],
"value": [10, 20, 30, 40]
})
result = df.groupby("category").agg(
total_value=("value", "sum"),
average_value=("value", "mean"),
count=("value", "count")
)
print(result)
Transform Operations
The transform() method returns a DataFrame with the same shape as the original, with values transformed within groups:
import pandas as pd
df = pd.DataFrame({
"department": ["Sales", "Sales", "Engineering", "Engineering"],
"salary": [55000, 72000, 85000, 90000]
})
# Calculate mean salary within each department
df["dept_avg_salary"] = df.groupby("department")["salary"].transform("mean")
# Calculate difference from group mean
df["salary_vs_dept_avg"] = df["salary"] - df["dept_avg_salary"]
print(df)
# department salary dept_avg_salary salary_vs_dept_avg
# 0 Sales 55000 63500.0 -8500.0
# 1 Sales 72000 63500.0 8500.0
# 2 Engineering 85000 87500.0 -2500.0
# 3 Engineering 90000 87500.0 2500.0
Filter Operations
The filter() method lets you keep or remove groups based on a condition:
import pandas as pd
df = pd.DataFrame({
"department": ["Sales", "Sales", "Sales", "Engineering", "Engineering"],
"employee": ["Alice", "Bob", "Carol", "Dave", "Eve"],
"salary": [55000, 72000, 48000, 85000, 90000]
})
# Keep only departments with average salary > 60000
filtered = df.groupby("department").filter(lambda x: x["salary"].mean() > 60000)
print(filtered)
Apply Custom Functions
The apply() method runs any function on each group:
import pandas as pd
df = pd.DataFrame({
"department": ["Sales", "Sales", "Engineering", "Engineering"],
"salary": [55000, 72000, 85000, 90000]
})
# Custom function to calculate percentage of total
def pct_of_total(group):
total = group["salary"].sum()
group["pct_of_total"] = group["salary"] / total * 100
return group
result = df.groupby("department", group_keys=False).apply(pct_of_total)
print(result)
Practical Example: Normalize Within Groups
import pandas as pd
import numpy as np
df = pd.DataFrame({
"category": ["A", "A", "A", "B", "B", "B"],
"value": [10, 20, 30, 100, 200, 300]
})
# Normalize values within each category to 0-1 range
def normalize(group):
min_val = group["value"].min()
max_val = group["value"].max()
group["normalized"] = (group["value"] - min_val) / (max_val - min_val)
return group
result = df.groupby("category", group_keys=False).apply(normalize)
print(result)
Iteration Over Groups
You can iterate through groups directly:
import pandas as pd
df = pd.DataFrame({
"department": ["Sales", "Sales", "Engineering", "Engineering"],
"employee": ["Alice", "Bob", "Charlie", "Diana"],
"salary": [55000, 72000, 85000, 90000]
})
# Iterate over groups
for dept, group in df.groupby("department"):
print(f"\n=== {dept} ===")
print(group)
Working with Indexes
Setting and Resetting Index
import pandas as pd
df = pd.DataFrame({
"region": ["North", "North", "South", "South"],
"product": ["X", "Y", "X", "Y"],
"sales": [100, 150, 200, 180]
})
# Group by with MultiIndex
result = df.groupby(["region", "product"])["sales"].sum()
print(result)
print(type(result)) # Series with MultiIndex
# Reset to DataFrame
result_df = result.reset_index()
print(result_df)
Using index in groupby
import pandas as pd
df = pd.DataFrame({
"value": [10, 20, 30, 40]
}, index=["a", "a", "b", "b"])
# Group by index
result = df.groupby(level=0)["value"].sum()
print(result)
Common Use Cases
Counting Values
import pandas as pd
df = pd.DataFrame({
"category": ["A", "B", "A", "C", "B", "A"]
})
# Count occurrences of each category
print(df["category"].value_counts())
# Using groupby
print(df.groupby("category").size())
Ranking Within Groups
import pandas as pd
df = pd.DataFrame({
"department": ["Sales", "Sales", "Sales", "Engineering", "Engineering"],
"employee": ["Alice", "Bob", "Carol", "Dave", "Eve"],
"salary": [55000, 72000, 48000, 85000, 90000]
})
# Rank employees within department by salary
df["dept_rank"] = df.groupby("department")["salary"].rank(ascending=False)
print(df)
Cumulative Sum Within Groups
import pandas as pd
df = pd.DataFrame({
"category": ["A", "A", "A", "B", "B", "B"],
"value": [10, 20, 30, 100, 200, 300]
})
# Cumulative sum within each category
df["cumsum"] = df.groupby("category")["value"].cumsum()
print(df)
Getting Started
The groupby operation is essential for data analysis. Remember these key points:
- Use
groupby(column)to split data by one or more columns - Apply aggregation functions like
sum(),mean(),count()to get summaries - Use
agg()for multiple functions or custom names - Use
transform()when you need to return values aligned with the original DataFrame - Use
filter()to select groups based on group-level conditions - Use
apply()for arbitrary custom operations on groups
From here, explore pandas DataFrames for more data manipulation techniques, and learn about data cleaning to prepare your datasets.
See Also
- pandas DataFrames Explained — Master the DataFrame structure
- Getting Started with pandas — Installation and first steps
- Data Cleaning with pandas — Clean and prepare your data