pandas groupby: Split, Apply, Combine

· 6 min read · Updated March 14, 2026 · intermediate
python pandas groupby data

The pandas groupby() operation is one of the most powerful features for data analysis. It lets you split data into groups based on some criteria, apply a function to each group independently, and combine the results back together. This “split-apply-combine” pattern is fundamental to data analysis.

This guide walks you through everything you need to know about groupby in pandas.

Installing and Importing pandas

If you haven’t installed pandas yet:

pip install pandas

Import pandas in your code:

import pandas as pd
import numpy as np

Understanding the Split-Apply-Combine Pattern

The groupby operation follows three steps:

  1. Split — Break data into groups based on some criteria
  2. Apply — Run calculations on each group independently
  3. Combine — Merge results back into a single DataFrame

Here’s a simple example:

import pandas as pd

# Sample data
df = pd.DataFrame({
    "department": ["Sales", "Sales", "Engineering", "Engineering", "HR"],
    "employee": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "salary": [55000, 72000, 85000, 90000, 50000]
})

print(df)
#     department employee  salary
# 0        Sales    Alice   55000
# 1        Sales      Bob   72000
# 2  Engineering  Charlie   85000
# 3  Engineering    Diana   90000
# 4           HR      Eve   50000

# Group by department
grouped = df.groupby("department")

# Apply mean salary to each group
print(grouped["salary"].mean())
# department
# Engineering    87500.0
# HR             50000.0
# Sales          63500.0
# Name: salary, dtype: float64

Basic GroupBy Operations

Grouping by One Column

The simplest form of groupby:

import pandas as pd

df = pd.DataFrame({
    "category": ["A", "A", "B", "B", "C"],
    "value": [10, 20, 30, 40, 50]
})

# Group by category
grouped = df.groupby("category")

# Different aggregation functions
print(grouped["value"].sum())    # Sum per group
print(grouped["value"].mean())   # Mean per group
print(grouped["value"].count())  # Count per group
print(grouped["value"].min())    # Min per group
print(grouped["value"].max())    # Max per group
print(grouped["value"].std())    # Standard deviation
print(grouped["value"].first())  # First value
print(grouped["value"].last())   # Last value

Grouping by Multiple Columns

Pass multiple column names to group by combinations:

import pandas as pd

df = pd.DataFrame({
    "region": ["North", "North", "South", "South", "North", "South"],
    "product": ["X", "Y", "X", "Y", "X", "Y"],
    "sales": [100, 150, 200, 180, 120, 190]
})

# Group by region AND product
grouped = df.groupby(["region", "product"])
print(grouped["sales"].sum())
# region  product
# North   X          220
#         Y          150
# South   X          200
#         Y          370
# Name: sales, dtype: int64

Using as_index=False

Keep the grouped columns as regular columns:

import pandas as pd

df = pd.DataFrame({
    "department": ["Sales", "Sales", "Engineering", "Engineering"],
    "salary": [55000, 72000, 85000, 90000]
})

# as_index=False keeps department as a column
result = df.groupby("department", as_index=False)["salary"].mean()
print(result)
#     department    salary
# 0  Engineering  87500.0
# 1        Sales  63500.0

The agg() Method

The agg() method lets you apply multiple aggregation functions at once:

import pandas as pd

df = pd.DataFrame({
    "department": ["Sales", "Sales", "Engineering", "Engineering", "HR"],
    "salary": [55000, 72000, 85000, 90000, 50000],
    "bonus": [5000, 7000, 8000, 9000, 4000]
})

# Single function as string
print(df.groupby("department")["salary"].agg("mean"))

# Multiple functions as list
print(df.groupby("department")["salary"].agg(["mean", "sum", "count"]))

# Different functions per column
print(df.groupby("department").agg({
    "salary": ["mean", "max"],
    "bonus": ["sum", "min"]
}))

Named Aggregations

Create descriptive column names:

import pandas as pd

df = pd.DataFrame({
    "category": ["A", "A", "B", "B"],
    "value": [10, 20, 30, 40]
})

result = df.groupby("category").agg(
    total_value=("value", "sum"),
    average_value=("value", "mean"),
    count=("value", "count")
)
print(result)

Transform Operations

The transform() method returns a DataFrame with the same shape as the original, with values transformed within groups:

import pandas as pd

df = pd.DataFrame({
    "department": ["Sales", "Sales", "Engineering", "Engineering"],
    "salary": [55000, 72000, 85000, 90000]
})

# Calculate mean salary within each department
df["dept_avg_salary"] = df.groupby("department")["salary"].transform("mean")

# Calculate difference from group mean
df["salary_vs_dept_avg"] = df["salary"] - df["dept_avg_salary"]

print(df)
#     department  salary  dept_avg_salary  salary_vs_dept_avg
# 0        Sales   55000         63500.0             -8500.0
# 1        Sales   72000         63500.0              8500.0
# 2  Engineering   85000         87500.0             -2500.0
# 3  Engineering   90000         87500.0              2500.0

Filter Operations

The filter() method lets you keep or remove groups based on a condition:

import pandas as pd

df = pd.DataFrame({
    "department": ["Sales", "Sales", "Sales", "Engineering", "Engineering"],
    "employee": ["Alice", "Bob", "Carol", "Dave", "Eve"],
    "salary": [55000, 72000, 48000, 85000, 90000]
})

# Keep only departments with average salary > 60000
filtered = df.groupby("department").filter(lambda x: x["salary"].mean() > 60000)
print(filtered)

Apply Custom Functions

The apply() method runs any function on each group:

import pandas as pd

df = pd.DataFrame({
    "department": ["Sales", "Sales", "Engineering", "Engineering"],
    "salary": [55000, 72000, 85000, 90000]
})

# Custom function to calculate percentage of total
def pct_of_total(group):
    total = group["salary"].sum()
    group["pct_of_total"] = group["salary"] / total * 100
    return group

result = df.groupby("department", group_keys=False).apply(pct_of_total)
print(result)

Practical Example: Normalize Within Groups

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "category": ["A", "A", "A", "B", "B", "B"],
    "value": [10, 20, 30, 100, 200, 300]
})

# Normalize values within each category to 0-1 range
def normalize(group):
    min_val = group["value"].min()
    max_val = group["value"].max()
    group["normalized"] = (group["value"] - min_val) / (max_val - min_val)
    return group

result = df.groupby("category", group_keys=False).apply(normalize)
print(result)

Iteration Over Groups

You can iterate through groups directly:

import pandas as pd

df = pd.DataFrame({
    "department": ["Sales", "Sales", "Engineering", "Engineering"],
    "employee": ["Alice", "Bob", "Charlie", "Diana"],
    "salary": [55000, 72000, 85000, 90000]
})

# Iterate over groups
for dept, group in df.groupby("department"):
    print(f"\n=== {dept} ===")
    print(group)

Working with Indexes

Setting and Resetting Index

import pandas as pd

df = pd.DataFrame({
    "region": ["North", "North", "South", "South"],
    "product": ["X", "Y", "X", "Y"],
    "sales": [100, 150, 200, 180]
})

# Group by with MultiIndex
result = df.groupby(["region", "product"])["sales"].sum()
print(result)
print(type(result))  # Series with MultiIndex

# Reset to DataFrame
result_df = result.reset_index()
print(result_df)

Using index in groupby

import pandas as pd

df = pd.DataFrame({
    "value": [10, 20, 30, 40]
}, index=["a", "a", "b", "b"])

# Group by index
result = df.groupby(level=0)["value"].sum()
print(result)

Common Use Cases

Counting Values

import pandas as pd

df = pd.DataFrame({
    "category": ["A", "B", "A", "C", "B", "A"]
})

# Count occurrences of each category
print(df["category"].value_counts())

# Using groupby
print(df.groupby("category").size())

Ranking Within Groups

import pandas as pd

df = pd.DataFrame({
    "department": ["Sales", "Sales", "Sales", "Engineering", "Engineering"],
    "employee": ["Alice", "Bob", "Carol", "Dave", "Eve"],
    "salary": [55000, 72000, 48000, 85000, 90000]
})

# Rank employees within department by salary
df["dept_rank"] = df.groupby("department")["salary"].rank(ascending=False)
print(df)

Cumulative Sum Within Groups

import pandas as pd

df = pd.DataFrame({
    "category": ["A", "A", "A", "B", "B", "B"],
    "value": [10, 20, 30, 100, 200, 300]
})

# Cumulative sum within each category
df["cumsum"] = df.groupby("category")["value"].cumsum()
print(df)

Getting Started

The groupby operation is essential for data analysis. Remember these key points:

  • Use groupby(column) to split data by one or more columns
  • Apply aggregation functions like sum(), mean(), count() to get summaries
  • Use agg() for multiple functions or custom names
  • Use transform() when you need to return values aligned with the original DataFrame
  • Use filter() to select groups based on group-level conditions
  • Use apply() for arbitrary custom operations on groups

From here, explore pandas DataFrames for more data manipulation techniques, and learn about data cleaning to prepare your datasets.

See Also