pyguides

pandas groupby: Split, Apply, Combine

The pandas groupby() operation is one of the most powerful features for data analysis. It lets you split data into groups based on some criteria, apply a function to each group independently, and combine the results back together. This “split-apply-combine” pattern is fundamental to data analysis.

This guide walks you through everything you need to know about groupby in pandas.

Installing and Importing pandas

If you haven’t installed pandas yet:

pip install pandas

Import pandas in your code:

import pandas as pd
import numpy as np

Understanding the Split-Apply-Combine Pattern

The groupby operation follows three steps:

  1. Split — Break data into groups based on some criteria
  2. Apply — Run calculations on each group independently
  3. Combine — Merge results back into a single DataFrame

Here’s a simple example:

import pandas as pd

# Sample data
df = pd.DataFrame({
    "department": ["Sales", "Sales", "Engineering", "Engineering", "HR"],
    "employee": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "salary": [55000, 72000, 85000, 90000, 50000]
})

print(df)
#     department employee  salary
# 0        Sales    Alice   55000
# 1        Sales      Bob   72000
# 2  Engineering  Charlie   85000
# 3  Engineering    Diana   90000
# 4           HR      Eve   50000

# Group by department
grouped = df.groupby("department")

# Apply mean salary to each group
print(grouped["salary"].mean())
# department
# Engineering    87500.0
# HR             50000.0
# Sales          63500.0
# Name: salary, dtype: float64

Basic GroupBy Operations

Grouping by One Column

The simplest form of groupby:

import pandas as pd

df = pd.DataFrame({
    "category": ["A", "A", "B", "B", "C"],
    "value": [10, 20, 30, 40, 50]
})

# Group by category
grouped = df.groupby("category")

# Different aggregation functions
print(grouped["value"].sum())    # Sum per group
print(grouped["value"].mean())   # Mean per group
print(grouped["value"].count())  # Count per group
print(grouped["value"].min())    # Min per group
print(grouped["value"].max())    # Max per group
print(grouped["value"].std())    # Standard deviation
print(grouped["value"].first())  # First value
print(grouped["value"].last())   # Last value

Grouping by Multiple Columns

Pass multiple column names to group by combinations:

import pandas as pd

df = pd.DataFrame({
    "region": ["North", "North", "South", "South", "North", "South"],
    "product": ["X", "Y", "X", "Y", "X", "Y"],
    "sales": [100, 150, 200, 180, 120, 190]
})

# Group by region AND product
grouped = df.groupby(["region", "product"])
print(grouped["sales"].sum())
# region  product
# North   X          220
#         Y          150
# South   X          200
#         Y          370
# Name: sales, dtype: int64

Using as_index=False

Keep the grouped columns as regular columns:

import pandas as pd

df = pd.DataFrame({
    "department": ["Sales", "Sales", "Engineering", "Engineering"],
    "salary": [55000, 72000, 85000, 90000]
})

# as_index=False keeps department as a column
result = df.groupby("department", as_index=False)["salary"].mean()
print(result)
#     department    salary
# 0  Engineering  87500.0
# 1        Sales  63500.0

The agg() Method

The agg() method lets you apply multiple aggregation functions at once:

import pandas as pd

df = pd.DataFrame({
    "department": ["Sales", "Sales", "Engineering", "Engineering", "HR"],
    "salary": [55000, 72000, 85000, 90000, 50000],
    "bonus": [5000, 7000, 8000, 9000, 4000]
})

# Single function as string
print(df.groupby("department")["salary"].agg("mean"))

# Multiple functions as list
print(df.groupby("department")["salary"].agg(["mean", "sum", "count"]))

# Different functions per column
print(df.groupby("department").agg({
    "salary": ["mean", "max"],
    "bonus": ["sum", "min"]
}))

Named Aggregations

Create descriptive column names:

import pandas as pd

df = pd.DataFrame({
    "category": ["A", "A", "B", "B"],
    "value": [10, 20, 30, 40]
})

result = df.groupby("category").agg(
    total_value=("value", "sum"),
    average_value=("value", "mean"),
    count=("value", "count")
)
print(result)

Transform Operations

The transform() method returns a DataFrame with the same shape as the original, with values transformed within groups:

import pandas as pd

df = pd.DataFrame({
    "department": ["Sales", "Sales", "Engineering", "Engineering"],
    "salary": [55000, 72000, 85000, 90000]
})

# Calculate mean salary within each department
df["dept_avg_salary"] = df.groupby("department")["salary"].transform("mean")

# Calculate difference from group mean
df["salary_vs_dept_avg"] = df["salary"] - df["dept_avg_salary"]

print(df)
#     department  salary  dept_avg_salary  salary_vs_dept_avg
# 0        Sales   55000         63500.0             -8500.0
# 1        Sales   72000         63500.0              8500.0
# 2  Engineering   85000         87500.0             -2500.0
# 3  Engineering   90000         87500.0              2500.0

Filter Operations

The filter() method lets you keep or remove groups based on a condition:

import pandas as pd

df = pd.DataFrame({
    "department": ["Sales", "Sales", "Sales", "Engineering", "Engineering"],
    "employee": ["Alice", "Bob", "Carol", "Dave", "Eve"],
    "salary": [55000, 72000, 48000, 85000, 90000]
})

# Keep only departments with average salary > 60000
filtered = df.groupby("department").filter(lambda x: x["salary"].mean() > 60000)
print(filtered)

Apply Custom Functions

The apply() method runs any function on each group:

import pandas as pd

df = pd.DataFrame({
    "department": ["Sales", "Sales", "Engineering", "Engineering"],
    "salary": [55000, 72000, 85000, 90000]
})

# Custom function to calculate percentage of total
def pct_of_total(group):
    total = group["salary"].sum()
    group["pct_of_total"] = group["salary"] / total * 100
    return group

result = df.groupby("department", group_keys=False).apply(pct_of_total)
print(result)

Practical Example: Normalize Within Groups

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "category": ["A", "A", "A", "B", "B", "B"],
    "value": [10, 20, 30, 100, 200, 300]
})

# Normalize values within each category to 0-1 range
def normalize(group):
    min_val = group["value"].min()
    max_val = group["value"].max()
    group["normalized"] = (group["value"] - min_val) / (max_val - min_val)
    return group

result = df.groupby("category", group_keys=False).apply(normalize)
print(result)

Iteration Over Groups

You can iterate through groups directly:

import pandas as pd

df = pd.DataFrame({
    "department": ["Sales", "Sales", "Engineering", "Engineering"],
    "employee": ["Alice", "Bob", "Charlie", "Diana"],
    "salary": [55000, 72000, 85000, 90000]
})

# Iterate over groups
for dept, group in df.groupby("department"):
    print(f"\n=== {dept} ===")
    print(group)

Working with Indexes

Setting and Resetting Index

import pandas as pd

df = pd.DataFrame({
    "region": ["North", "North", "South", "South"],
    "product": ["X", "Y", "X", "Y"],
    "sales": [100, 150, 200, 180]
})

# Group by with MultiIndex
result = df.groupby(["region", "product"])["sales"].sum()
print(result)
print(type(result))  # Series with MultiIndex

# Reset to DataFrame
result_df = result.reset_index()
print(result_df)

Using index in groupby

import pandas as pd

df = pd.DataFrame({
    "value": [10, 20, 30, 40]
}, index=["a", "a", "b", "b"])

# Group by index
result = df.groupby(level=0)["value"].sum()
print(result)

Common Use Cases

Counting Values

import pandas as pd

df = pd.DataFrame({
    "category": ["A", "B", "A", "C", "B", "A"]
})

# Count occurrences of each category
print(df["category"].value_counts())

# Using groupby
print(df.groupby("category").size())

Ranking Within Groups

import pandas as pd

df = pd.DataFrame({
    "department": ["Sales", "Sales", "Sales", "Engineering", "Engineering"],
    "employee": ["Alice", "Bob", "Carol", "Dave", "Eve"],
    "salary": [55000, 72000, 48000, 85000, 90000]
})

# Rank employees within department by salary
df["dept_rank"] = df.groupby("department")["salary"].rank(ascending=False)
print(df)

Cumulative Sum Within Groups

import pandas as pd

df = pd.DataFrame({
    "category": ["A", "A", "A", "B", "B", "B"],
    "value": [10, 20, 30, 100, 200, 300]
})

# Cumulative sum within each category
df["cumsum"] = df.groupby("category")["value"].cumsum()
print(df)

Getting Started

The groupby operation is essential for data analysis. Remember these key points:

  • Use groupby(column) to split data by one or more columns
  • Apply aggregation functions like sum(), mean(), count() to get summaries
  • Use agg() for multiple functions or custom names
  • Use transform() when you need to return values aligned with the original DataFrame
  • Use filter() to select groups based on group-level conditions
  • Use apply() for arbitrary custom operations on groups

From here, explore pandas DataFrames for more data manipulation techniques, and learn about data cleaning to prepare your datasets.

See Also