Processing CSV Files in Python
CSV (Comma-Separated Values) files are everywhere. They are the lowest common denominator for data exchange — spreadsheets, databases, and APIs all speak CSV. Python gives you multiple ways to work with them, from the simple csv module to the powerful pandas library.
The CSV Format
A CSV file is plain text with rows separated by newlines and columns separated by commas:
name,age,city
Alice,30,New York
Bob,25,San Francisco
Charlie,35,Chicago
But CSVs are trickier than they look. What if a value contains a comma? The standard solution is wrapping fields in quotes:
name,city,interest
Alice,New York,"reading, coding"
Bob,Chicago,sports
Python’s csv module handles these edge cases automatically.
Reading CSV Files with the csv Module
The csv module is part of Python’s standard library:
import csv
with open("data.csv", "r", newline="") as f:
reader = csv.reader(f)
for row in reader:
print(row)
Output:
['name', 'age', 'city']
['Alice', '30', 'New York']
['Bob', '25', 'San Francisco']
Always use newline="" when opening CSV files. It prevents issues with line endings across different operating systems.
Reading CSV Files into Dictionaries
Reading into dictionaries gives you named fields instead of positional indices:
import csv
with open("data.csv", "r", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
print(f"{row['name']} is {row['age']} years old")
Output:
Alice is 30 years old
Bob is 25 years old
The first row is automatically used as the header. This makes your code more readable and robust.
Writing CSV Files
Writing is just as straightforward:
import csv
data = [
["name", "age", "city"],
["Alice", 30, "New York"],
["Bob", 25, "San Francisco"]
]
with open("output.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(data)
For dictionary-based data:
import csv
data = [
{"name": "Alice", "age": 30, "city": "New York"},
{"name": "Bob", "age": 25, "city": "San Francisco"}
]
fieldnames = ["name", "age", "city"]
with open("output.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(data)
Handling Different Delimiters
CSV files do not always use commas. Tab-separated files (TSV) are common:
import csv
with open("data.tsv", "r", newline="") as f:
reader = csv.reader(f, delimiter="\t")
for row in reader:
print(row)
Other delimiters include:
- Pipe (
|) - Semicolon (
;) - Single tab for TSV files
Reading Large Files
Loading a massive CSV into memory can crash your program. Process it line by line instead:
import csv
total = 0
count = 0
with open("large_file.csv", "r", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
total += int(row["amount"])
count += 1
print(f"Average: {total / count}")
This approach uses constant memory regardless of file size.
Using Pandas for CSV Files
For analysis and data manipulation, pandas is the standard choice:
import pandas as pd
# Read CSV into a DataFrame
df = pd.read_csv("data.csv")
print(df.head())
print(df.shape)
Pandas handles millions of rows, missing values, and automatic type inference.
Selecting Columns
import pandas as pd
df = pd.read_csv("data.csv")
# Single column
names = df["name"]
# Multiple columns
subset = df[["name", "age"]]
Filtering Rows
import pandas as pd
df = pd.read_csv("data.csv")
# Filter by condition
adults = df[df["age"] >= 18]
# Multiple conditions
young_in_ny = df[(df["age"] < 30) & (df["city"] == "New York")]
Grouping and Aggregating
import pandas as pd
df = pd.read_csv("sales.csv")
# Group by city and calculate mean sales
city_stats = df.groupby("city")["sales"].mean()
# Multiple aggregations
summary = df.groupby("category").agg({
"sales": ["sum", "mean", "count"],
"quantity": "sum"
})
Writing to CSV
import pandas as pd
df = pd.read_csv("input.csv")
# Filter and transform
result = df[df["active"] == True]
# Write to CSV
result.to_csv("output.csv", index=False)
The index=False prevents pandas from writing row indices as a column.
Common Pitfalls
Missing newline parameter
# Wrong - causes double newlines on Windows
with open("data.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(["a", "b"])
# Correct
with open("data.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["a", "b"])
Not handling missing values
import pandas as pd
# Default behavior: missing values become NaN
df = pd.read_csv("data.csv")
# Fill missing values
df = df.fillna(0)
# Drop rows with missing values
df = df.dropna()
Encoding issues
import pandas as pd
# Specify encoding if the file is not UTF-8
df = pd.read_csv("data.csv", encoding="latin-1")
# Common encodings: utf-8, latin-1, cp1252, iso-8859-1
Quick Reference
| Task | csv Module | Pandas |
|---|---|---|
| Read file | csv.reader(f) | pd.read_csv() |
| Write file | csv.writer(f) | df.to_csv() |
| Read as dicts | csv.DictReader(f) | pd.read_csv() then to_dict() |
| Large files | Line-by-line processing | chunksize parameter |
| Filtering | Manual loops | df[df[col] > val] |
See Also
- csv-module — Complete reference for the csv module
- reading-json — Working with JSON files
- pandas-getting-started — Introduction to pandas