Data Cleaning with pandas
Real-world data is messy. You will rarely download a perfectly formatted dataset ready for analysis. Missing values, inconsistent formatting, duplicate rows, and messy strings are the norm rather than exception. Data cleaning is the process of fixing these issues, and pandas provides powerful tools to handle them efficiently.
This tutorial covers the essential data cleaning techniques you will use in almost every data analysis project. By the end, you will be able to confidently transform raw, messy data into analysis-ready form.
Setting Up
Import pandas and create a sample dirty dataset to practice with:
import pandas as pd
import numpy as np
# Sample messy data
data = {
"name": [" Alice ", "Bob", "ALICE", None, "diana"],
"email": ["alice@email.com", "bob@email.com", "alice@email.com", "invalid-email", "diana@EMAIL.COM"],
"age": [25, 30, 35, None, "unknown"],
"salary": [50000, 60000, None, 55000, 58000],
"department": ["Sales", "sales ", "SALES", "Marketing", "marketing"],
"join_date": ["2020-01-15", "2020/02/20", "2020.03.10", "2020-04-05", None]
}
df = pd.DataFrame(data)
print(df)
This small dataset contains many common data quality issues you will encounter in practice.
Handling Missing Data
Missing data is ubiquitous in real datasets. pandas represents missing values as NaN (Not a Number).
Detecting Missing Values
# Check for missing values in entire DataFrame
print(df.isnull())
# Count missing values per column
print(df.isnull().sum())
# Get percentage of missing values
print(df.isnull().sum() / len(df) * 100)
Dropping Missing Values
The simplest approach is to remove rows or columns with missing values:
# Drop rows with any missing values
df_clean = df.dropna()
print(f"Dropped {len(df) - len(df_clean)} rows")
# Drop rows where a specific column is missing
df_clean = df.dropna(subset=["email"])
Dropping is appropriate when missing values are rare or when the missing data is not informative. However, you lose information, which may not be ideal for analysis.
Filling Missing Values
Often you want to preserve data by filling missing values with something meaningful:
# Fill with a constant value
df["age"] = df["age"].fillna(0)
# Fill with the mean (for numerical columns)
df["salary"] = df["salary"].fillna(df["salary"].mean())
# Fill with forward fill (use previous value)
df["salary"] = df["salary"].ffill()
# Fill with backward fill (use next value)
df["salary"] = df["salary"].bfill()
# Fill with interpolation
df["salary"] = df["salary"].interpolate()
For categorical data, filling with the mode (most frequent value) is often sensible:
df["department"] = df["department"].fillna(df["department"].mode()[0])
Removing Duplicates
Duplicate rows can skew your analysis by giving extra weight to certain observations.
Finding Duplicates
# Check for duplicate rows
print(df.duplicated())
# Count duplicates
print(df.duplicated().sum())
Removing Duplicates
# Drop duplicate rows
df_clean = df.drop_duplicates()
# Keep the last occurrence instead of first
df_clean = df.drop_duplicates(keep="last")
# Drop duplicates based on specific columns
df_clean = df.drop_duplicates(subset=["email"])
String Cleaning
String data often has inconsistent formatting that needs standardization.
Stripping Whitespace
# Strip leading/trailing whitespace from all string columns
df["name"] = df["name"].str.strip()
df["department"] = df["department"].str.strip()
Case Normalization
# Convert to lowercase
df["name"] = df["name"].str.lower()
df["department"] = df["department"].str.lower()
# Or uppercase
df["name"] = df["name"].str.upper()
Removing Unwanted Characters
# Remove specific characters
df["department"] = df["department"].str.replace(" ", "", regex=False)
# Remove all digits
df["name"] = df["name"].str.replace(r"\d", "", regex=True)
# Remove all non-alphabetic characters
df["name"] = df["name"].str.replace(r"[^a-zA-Z]", "", regex=True)
Data Type Conversion
Ensuring correct data types prevents errors and enables proper analysis.
Converting Numerical Types
# Convert string to numeric
df["age"] = pd.to_numeric(df["age"], errors="coerce")
# Convert to integer
df["age"] = df["age"].astype("Int64")
# Convert to float
df["salary"] = df["salary"].astype("float64")
The errors="coerce" parameter converts invalid values to NaN instead of raising an error.
Parsing Dates
# Parse date strings
df["join_date"] = pd.to_datetime(df["join_date"], errors="coerce")
# Extract date components
df["join_year"] = df["join_date"].dt.year
df["join_month"] = df["join_date"].dt.month
print(df)
The datetime accessor (.dt) provides numerous date-related operations.
Handling Outliers
Outliers are extreme values that can skew statistical analysis.
Detecting Outliers with IQR
# Calculate IQR
Q1 = df["salary"].quantile(0.25)
Q3 = df["salary"].quantile(0.75)
IQR = Q3 - Q1
# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify outliers
outliers = df[(df["salary"] < lower_bound) | (df["salary"] > upper_bound)]
print(outliers)
# Remove outliers
df_clean = df[(df["salary"] >= lower_bound) & (df["salary"] <= upper_bound)]
Handling Outliers
# Cap outliers at boundaries
df["salary"] = df["salary"].clip(lower=lower_bound, upper=upper_bound)
# Replace outliers with median
median = df["salary"].median()
df.loc[df["salary"] > upper_bound, "salary"] = median
Column and Index Operations
Renaming Columns
# Rename specific columns
df = df.rename(columns={"name": "Name", "email": "Email"})
# Rename all columns to lowercase
df.columns = df.columns.str.lower()
Resetting Index
When you drop rows, the index has gaps:
# Reset index after dropping rows
df = df.reset_index(drop=True)
Reordering Columns
# Reorder columns
df = df[["name", "email", "age", "department", "salary", "join_date"]]
Chaining Operations
pandas allows you to chain operations for clean, readable code:
df_clean = (
df
.dropna(subset=["name", "email"])
.drop_duplicates(subset=["email"])
.assign(
name=lambda x: x["name"].str.strip().str.lower(),
department=lambda x: x["department"].str.strip().str.lower(),
age=lambda x: pd.to_numeric(x["age"], errors="coerce"),
salary=lambda x: pd.to_numeric(x["salary"], errors="coerce")
)
.reset_index(drop=True)
)
print(df_clean)
This chaining pattern is common in data analysis workflows.
A Complete Example
Putting it all together, here is a typical cleaning pipeline:
import pandas as pd
import numpy as np
def clean_employee_data(df):
"""Clean a DataFrame of employee data."""
# Make a copy to avoid modifying original
df = df.copy()
# Drop rows with missing essential fields
df = df.dropna(subset=["name", "email"])
# Remove duplicates
df = df.drop_duplicates(subset=["email"], keep="first")
# Clean strings
df["name"] = df["name"].str.strip().str.title()
df["email"] = df["email"].str.strip().str.lower()
df["department"] = df["department"].str.strip().str.lower()
# Convert types
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")
df["join_date"] = pd.to_datetime(df["join_date"], errors="coerce")
# Fill missing numerical values with median
for col in ["age", "salary"]:
df[col] = df[col].fillna(df[col].median())
# Fill missing categorical values with mode
df["department"] = df["department"].fillna(df["department"].mode()[0])
# Reset index
df = df.reset_index(drop=True)
return df
# Example usage
df = pd.DataFrame({
"name": [" John ", "Jane", "JOHN", None, "Alice"],
"email": ["john@email.com", "jane@email.com", "john@email.com", "invalid", "alice@EMAIL.COM"],
"age": [25, 30, None, 35, "unknown"],
"salary": [50000, 60000, 55000, None, 58000],
"department": ["Sales", "sales ", None, "Marketing", "marketing"],
"join_date": ["2020-01-15", "2020-02-20", "2020-03-10", "2020-04-05", "2020-05-01"]
})
cleaned = clean_employee_data(df)
print(cleaned)
Common Pitfalls
Watch out for these common mistakes:
- Forgetting inplace=True and not assigning the result
- Not handling mixed-type columns before conversion
- Using mean for skewed data when median is more appropriate
- Not checking for hidden whitespace in strings
- Losing data by dropping rows without considering the impact
See Also
- Getting Started with pandas — Learn pandas fundamentals
- Getting Started with NumPy — Understand NumPy arrays
- CSV Files to Data Analysis — Load and analyze CSV data