Data Cleaning with pandas

· 5 min read · Updated March 13, 2026 · intermediate
pandas data-cleaning data-analysis dataframes intermediate

Real-world data is messy. You will rarely download a perfectly formatted dataset ready for analysis. Missing values, inconsistent formatting, duplicate rows, and messy strings are the norm rather than exception. Data cleaning is the process of fixing these issues, and pandas provides powerful tools to handle them efficiently.

This tutorial covers the essential data cleaning techniques you will use in almost every data analysis project. By the end, you will be able to confidently transform raw, messy data into analysis-ready form.

Setting Up

Import pandas and create a sample dirty dataset to practice with:

import pandas as pd
import numpy as np

# Sample messy data
data = {
    "name": ["  Alice  ", "Bob", "ALICE", None, "diana"],
    "email": ["alice@email.com", "bob@email.com", "alice@email.com", "invalid-email", "diana@EMAIL.COM"],
    "age": [25, 30, 35, None, "unknown"],
    "salary": [50000, 60000, None, 55000, 58000],
    "department": ["Sales", "sales ", "SALES", "Marketing", "marketing"],
    "join_date": ["2020-01-15", "2020/02/20", "2020.03.10", "2020-04-05", None]
}

df = pd.DataFrame(data)
print(df)

This small dataset contains many common data quality issues you will encounter in practice.

Handling Missing Data

Missing data is ubiquitous in real datasets. pandas represents missing values as NaN (Not a Number).

Detecting Missing Values

# Check for missing values in entire DataFrame
print(df.isnull())

# Count missing values per column
print(df.isnull().sum())

# Get percentage of missing values
print(df.isnull().sum() / len(df) * 100)

Dropping Missing Values

The simplest approach is to remove rows or columns with missing values:

# Drop rows with any missing values
df_clean = df.dropna()
print(f"Dropped {len(df) - len(df_clean)} rows")

# Drop rows where a specific column is missing
df_clean = df.dropna(subset=["email"])

Dropping is appropriate when missing values are rare or when the missing data is not informative. However, you lose information, which may not be ideal for analysis.

Filling Missing Values

Often you want to preserve data by filling missing values with something meaningful:

# Fill with a constant value
df["age"] = df["age"].fillna(0)

# Fill with the mean (for numerical columns)
df["salary"] = df["salary"].fillna(df["salary"].mean())

# Fill with forward fill (use previous value)
df["salary"] = df["salary"].ffill()

# Fill with backward fill (use next value)
df["salary"] = df["salary"].bfill()

# Fill with interpolation
df["salary"] = df["salary"].interpolate()

For categorical data, filling with the mode (most frequent value) is often sensible:

df["department"] = df["department"].fillna(df["department"].mode()[0])

Removing Duplicates

Duplicate rows can skew your analysis by giving extra weight to certain observations.

Finding Duplicates

# Check for duplicate rows
print(df.duplicated())

# Count duplicates
print(df.duplicated().sum())

Removing Duplicates

# Drop duplicate rows
df_clean = df.drop_duplicates()

# Keep the last occurrence instead of first
df_clean = df.drop_duplicates(keep="last")

# Drop duplicates based on specific columns
df_clean = df.drop_duplicates(subset=["email"])

String Cleaning

String data often has inconsistent formatting that needs standardization.

Stripping Whitespace

# Strip leading/trailing whitespace from all string columns
df["name"] = df["name"].str.strip()
df["department"] = df["department"].str.strip()

Case Normalization

# Convert to lowercase
df["name"] = df["name"].str.lower()
df["department"] = df["department"].str.lower()

# Or uppercase
df["name"] = df["name"].str.upper()

Removing Unwanted Characters

# Remove specific characters
df["department"] = df["department"].str.replace(" ", "", regex=False)

# Remove all digits
df["name"] = df["name"].str.replace(r"\d", "", regex=True)

# Remove all non-alphabetic characters
df["name"] = df["name"].str.replace(r"[^a-zA-Z]", "", regex=True)

Data Type Conversion

Ensuring correct data types prevents errors and enables proper analysis.

Converting Numerical Types

# Convert string to numeric
df["age"] = pd.to_numeric(df["age"], errors="coerce")

# Convert to integer
df["age"] = df["age"].astype("Int64")

# Convert to float
df["salary"] = df["salary"].astype("float64")

The errors="coerce" parameter converts invalid values to NaN instead of raising an error.

Parsing Dates

# Parse date strings
df["join_date"] = pd.to_datetime(df["join_date"], errors="coerce")

# Extract date components
df["join_year"] = df["join_date"].dt.year
df["join_month"] = df["join_date"].dt.month

print(df)

The datetime accessor (.dt) provides numerous date-related operations.

Handling Outliers

Outliers are extreme values that can skew statistical analysis.

Detecting Outliers with IQR

# Calculate IQR
Q1 = df["salary"].quantile(0.25)
Q3 = df["salary"].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df["salary"] < lower_bound) | (df["salary"] > upper_bound)]
print(outliers)

# Remove outliers
df_clean = df[(df["salary"] >= lower_bound) & (df["salary"] <= upper_bound)]

Handling Outliers

# Cap outliers at boundaries
df["salary"] = df["salary"].clip(lower=lower_bound, upper=upper_bound)

# Replace outliers with median
median = df["salary"].median()
df.loc[df["salary"] > upper_bound, "salary"] = median

Column and Index Operations

Renaming Columns

# Rename specific columns
df = df.rename(columns={"name": "Name", "email": "Email"})

# Rename all columns to lowercase
df.columns = df.columns.str.lower()

Resetting Index

When you drop rows, the index has gaps:

# Reset index after dropping rows
df = df.reset_index(drop=True)

Reordering Columns

# Reorder columns
df = df[["name", "email", "age", "department", "salary", "join_date"]]

Chaining Operations

pandas allows you to chain operations for clean, readable code:

df_clean = (
    df
    .dropna(subset=["name", "email"])
    .drop_duplicates(subset=["email"])
    .assign(
        name=lambda x: x["name"].str.strip().str.lower(),
        department=lambda x: x["department"].str.strip().str.lower(),
        age=lambda x: pd.to_numeric(x["age"], errors="coerce"),
        salary=lambda x: pd.to_numeric(x["salary"], errors="coerce")
    )
    .reset_index(drop=True)
)

print(df_clean)

This chaining pattern is common in data analysis workflows.

A Complete Example

Putting it all together, here is a typical cleaning pipeline:

import pandas as pd
import numpy as np

def clean_employee_data(df):
    """Clean a DataFrame of employee data."""
    
    # Make a copy to avoid modifying original
    df = df.copy()
    
    # Drop rows with missing essential fields
    df = df.dropna(subset=["name", "email"])
    
    # Remove duplicates
    df = df.drop_duplicates(subset=["email"], keep="first")
    
    # Clean strings
    df["name"] = df["name"].str.strip().str.title()
    df["email"] = df["email"].str.strip().str.lower()
    df["department"] = df["department"].str.strip().str.lower()
    
    # Convert types
    df["age"] = pd.to_numeric(df["age"], errors="coerce")
    df["salary"] = pd.to_numeric(df["salary"], errors="coerce")
    df["join_date"] = pd.to_datetime(df["join_date"], errors="coerce")
    
    # Fill missing numerical values with median
    for col in ["age", "salary"]:
        df[col] = df[col].fillna(df[col].median())
    
    # Fill missing categorical values with mode
    df["department"] = df["department"].fillna(df["department"].mode()[0])
    
    # Reset index
    df = df.reset_index(drop=True)
    
    return df

# Example usage
df = pd.DataFrame({
    "name": ["  John  ", "Jane", "JOHN", None, "Alice"],
    "email": ["john@email.com", "jane@email.com", "john@email.com", "invalid", "alice@EMAIL.COM"],
    "age": [25, 30, None, 35, "unknown"],
    "salary": [50000, 60000, 55000, None, 58000],
    "department": ["Sales", "sales ", None, "Marketing", "marketing"],
    "join_date": ["2020-01-15", "2020-02-20", "2020-03-10", "2020-04-05", "2020-05-01"]
})

cleaned = clean_employee_data(df)
print(cleaned)

Common Pitfalls

Watch out for these common mistakes:

  1. Forgetting inplace=True and not assigning the result
  2. Not handling mixed-type columns before conversion
  3. Using mean for skewed data when median is more appropriate
  4. Not checking for hidden whitespace in strings
  5. Losing data by dropping rows without considering the impact

See Also