Build a CSV Data Analyzer

· 6 min read · Updated March 25, 2026 · beginner
python csv data stdlib

Python’s csv module is part of the standard library, which means you can start analyzing CSV files right now without installing anything. This guide walks through reading, filtering, aggregating, and writing CSV data step by step.

Reading CSV Files

The csv module ships with Python, so importing it is all you need to get started:

import csv

There are two main ways to read a CSV. The first uses csv.reader(), which gives you a list for each row:

import csv
from io import StringIO

# Simulated CSV data — StringIO lets you run this without a real file
csv_data = "name,score,grade\nAlice,95,A\nBob,78,C\nCarol,88,B\n"

with open(csv_data, 'r', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)
# output: ['name', 'score', 'grade']
# output: ['Alice', '95', 'A']
# output: ['Bob', '78', 'C']
# output: ['Carol', '88', 'B']

The positional index makes you write row[0], row[1] everywhere, which gets old fast. A better approach is csv.DictReader(), which uses the header row to give each column a name:

import csv
from io import StringIO

csv_data = "name,score,grade\nAlice,95,A\nBob,78,C\nCarol,88,B\n"

with open(csv_data, 'r', newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row['name'], "scored", row['score'])
# output: Alice scored 95
# output: Bob scored 78
# output: Carol scored 88

DictReader cuts down on mistakes. Use it by default.

Why newline='' Matters

When you open a CSV file, always pass newline='':

with open('data.csv', 'r', newline='', encoding='utf-8') as f:

Without it, the csv module may produce phantom blank rows on Windows. The module does its own newline translation internally, so you need to hand it a raw file handle.

Filtering Rows

Once you have rows, you probably want to filter them. A list comprehension with a condition handles most cases:

import csv
from io import StringIO

csv_data = "name,score,grade\nAlice,95,A\nBob,78,C\nCarol,88,B\n"

with open(csv_data, 'r', newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    high_scorers = [row for row in reader if int(row['score']) >= 85]

for row in high_scorers:
    print(row['name'], row['score'])
# output: Alice 95
# output: Carol 88

The int() conversion is necessary because csv always returns strings, even for numeric columns. Forcing a conversion inside the filter condition is cleaner than converting everywhere in your code.

If a row might have a missing or malformed value, wrap the conversion in a try/except:

import csv
from io import StringIO

csv_data = "name,score,grade\nAlice,95,A\nBob,,C\nCarol,invalid,B\n"

with open(csv_data, 'r', newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    valid = []
    for row in reader:
        try:
            score = int(row['score'])
            if score >= 85:
                valid.append(row)
        except (ValueError, KeyError):
            continue  # skip rows with missing or non-numeric scores

for row in valid:
    print(row['name'], row['score'])
# output: Alice 95

Computing Aggregates

Aggregation is where CSV analysis gets useful. You can compute sums, averages, min, and max with simple loops:

import csv
from io import StringIO

csv_data = "name,score,grade\nAlice,95,A\nBob,78,C\nCarol,88,B\nDave,73,D\n"

with open(csv_data, 'r', newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    scores = []
    for row in reader:
        scores.append(int(row['score']))

print(f"Count:  {len(scores)}")
print(f"Sum:    {sum(scores)}")
print(f"Average: {sum(scores) / len(scores):.1f}")
print(f"Min:    {min(scores)}")
print(f"Max:    {max(scores)}")
# output: Count:  4
# output: Sum:    334
# output: Average: 83.5
# output: Min:    73
# output: Max:    95

You can also find the row with the highest value by combining max() with a key function:

import csv
from io import StringIO

csv_data = "name,score,grade\nAlice,95,A\nBob,78,C\nCarol,88,B\n"

with open(csv_data, 'r', newline='', encoding='utf-8') as f:
    reader = list(csv.DictReader(f))

top_row = max(reader, key=lambda r: int(r['score']))
print(f"Top scorer: {top_row['name']} with {top_row['score']}")
# output: Top scorer: Alice with 95

Note that list(reader) loads the entire file into memory. For files larger than your available RAM, iterate one row at a time instead:

# Memory-efficient — never loads the whole file
running_total = 0
count = 0
with open('data.csv', 'r', newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        running_total += int(row['score'])
        count += 1

Sorting Results

Python’s built-in sorted() works on the list of rows. Use the key parameter to specify which column to sort by:

import csv
from io import StringIO

csv_data = "name,score,grade\nAlice,95,A\nBob,78,C\nCarol,88,B\n"

with open(csv_data, 'r', newline='', encoding='utf-8') as f:
    reader = list(csv.DictReader(f))

by_score = sorted(reader, key=lambda r: int(r['score']), reverse=True)

for row in by_score:
    print(row['name'], row['score'])
# output: Alice 95
# output: Carol 88
# output: Bob 78

Sort ascending by name instead:

by_name = sorted(reader, key=lambda r: r['name'])

Writing CSV Output

When you find what you’re looking for, you often want to save it. Use csv.writer() for lists or csv.DictWriter() for dictionaries:

import csv
from io import StringIO

output = StringIO()
writer = csv.writer(output)

writer.writerow(['name', 'score', 'grade'])   # header
writer.writerow(['Alice', 95, 'A'])            # single row
writer.writerows([['Bob', 78, 'C'], ['Carol', 88, 'B']])  # multiple rows

print(output.getvalue())
# output: name,score,grade
# output: Alice,95,A
# output: Bob,78,C
# output: Carol,88,B

For writing rows that came from a DictReader, use DictWriter:

import csv
from io import StringIO

csv_data = "name,score,grade\nAlice,95,A\nBob,78,C\nCarol,88,B\n"
fieldnames = ['name', 'score', 'grade']

output = StringIO()
writer = csv.DictWriter(output, fieldnames=fieldnames)
writer.writeheader()
writer.writerow({'name': 'Alice', 'score': 95, 'grade': 'A'})
writer.writerow({'name': 'Carol', 'score': 88, 'grade': 'B'})

print(output.getvalue())
# output: name,score,grade
# output: Alice,95,A
# output: Carol,88,B

Both writer functions require newline='' when writing to a real file. The StringIO version in these examples does not, but be careful not to forget it when you switch to disk files.

Handling Encoding

Always specify the encoding explicitly. UTF-8 is the standard for most modern CSV files:

with open('data.csv', 'r', encoding='utf-8', newline='') as f:

If a file came from Excel on Windows, it might have a BOM. Use utf-8-sig in that case:

with open('data.csv', 'r', encoding='utf-8-sig', newline='') as f:

If you encounter UnicodeDecodeError with UTF-8, the file is probably in an older encoding like latin-1. Fall back to that:

with open('data.csv', 'r', encoding='latin-1', newline='') as f:

When to Use Pandas Instead

The csv module handles most everyday CSV tasks without dependencies. But for large files (roughly 50 MB and up), or when you need operations like grouping, merging, or pivot tables, pandas is a better fit:

import pandas as pd

df = pd.read_csv('data.csv')
df[df['score'] > 80]                     # filter
df.groupby('grade')['score'].mean()      # groupby aggregate
df.sort_values('score', ascending=False) # sort
df.to_csv('output.csv', index=False)     # write

pandas loads the entire file into memory as a DataFrame, which makes those high-level operations convenient but costly for huge files. For quick scripts and moderate-sized data, the csv module is all you need.

Common Mistakes

No newline='' when writing. On Windows, open('out.csv', 'w') produces double-spaced rows. Pass newline=''.

Forgetting type conversion. Every value from csv.reader or DictReader is a string. int(row['score']) before any math.

Loading giant files into memory. Calling list(reader) on a 2 GB CSV will exhaust your RAM. Iterate through rows one at a time instead.

Manual comma splitting. Never do row = line.split(','). The csv module handles quoted fields with embedded commas correctly. split(',') does not.

Conclusion

The csv module gives you everything you need to read, filter, aggregate, and write tabular data without installing any packages. DictReader keeps column access readable. with open() plus newline='' prevents cross-platform newline headaches. For anything beyond basic analysis, pandas is a natural next step.

See Also