CSV Files to Data Analysis
CSV files are everywhere. Every spreadsheet can export to CSV. Every database can dump to CSV. Every data export comes as CSV. If you work with data, you’ll deal with CSV files constantly.
This tutorial teaches you how to read CSV files in Python, parse them into usable data, and perform basic analysis. You’ll learn two approaches: the built-in csv module for simple tasks, and pandas for powerful data analysis.
What is CSV?
CSV stands for Comma-Separated Values. It’s a plain text format where each line represents a row, and values are separated by commas. Here’s a simple example:
name,age,city
Alice,30,London
Bob,25,Manchester
Charlie,35,Bristol
The first line is typically the header row, containing column names. The CSV module in Python’s standard library handles reading and writing these files.
Reading CSV Files with the csv Module
Python’s csv module is built into the standard library. No installation needed. Start by opening the file and creating a reader object:
import csv
with open('data.csv', 'r', newline='') as file:
reader = csv.reader(file)
# Read the header first
headers = next(reader)
print(f"Columns: {headers}")
# Read each data row
for row in reader:
print(row)
The newline='' parameter is important on Windows - it prevents extra blank lines between rows.
Reading into a List of Dictionaries
Working with column names is easier. Use csv.DictReader to get dictionaries instead of lists:
import csv
with open('data.csv', 'r', newline='') as file:
reader = csv.DictReader(file)
for row in reader:
print(f"{row['name']} is {row['age']} years old")
Each row becomes a dictionary where keys are the column headers. This makes your code more readable.
Writing CSV Files
Creating CSV files is just as straightforward:
import csv
data = [
['name', 'age', 'city'],
['Alice', '30', 'London'],
['Bob', '25', 'Manchester'],
]
with open('output.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerows(data)
For dictionaries, use DictWriter:
import csv
data = [
{'name': 'Alice', 'age': 30, 'city': 'London'},
{'name': 'Bob', 'age': 25, 'city': 'Manchester'},
]
with open('output.csv', 'w', newline='') as file:
fieldnames = ['name', 'age', 'city']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(data)
Introduction to Pandas for CSV Analysis
The csv module works for simple tasks. When you need real data analysis, pandas is the tool. Install it first:
pip install pandas
Pandas reads CSV files into DataFrames - essentially tables that you can query, filter, and analyze:
import pandas as pd
# Read a CSV file into a DataFrame
df = pd.read_csv('data.csv')
# See the first few rows
print(df.head())
# Get basic info
print(df.info())
# Calculate statistics
print(df.describe())
The describe() method gives you count, mean, std, min, max, and quartile values for numeric columns automatically.
Filtering and Selecting Data
One of pandas’ strongest features is easy filtering. Select columns by name:
import pandas as pd
df = pd.read_csv('data.csv')
# Select a single column
names = df['name']
# Select multiple columns
subset = df[['name', 'age']]
Filter rows based on conditions:
import pandas as pd
df = pd.read_csv('data.csv')
# Filter where age is greater than 30
older = df[df['age'] > 30]
# Filter by city
londoners = df[df['city'] == 'London']
# Combine conditions with & (and) | (or)
adults_in_london = df[(df['age'] >= 18) & (df['city'] == 'London')]
Sorting and Aggregating Data
Sort DataFrames by any column:
import pandas as pd
df = pd.read_csv('data.csv')
# Sort by age, ascending
by_age = df.sort_values('age')
# Sort by age, descending
by_age_desc = df.sort_values('age', ascending=False)
# Sort by multiple columns
sorted_df = df.sort_values(['city', 'age'])
Aggregate to summarize data:
import pandas as pd
df = pd.read_csv('sales.csv')
# Sum a column
total_sales = df['sales'].sum()
# Group by category and sum
by_category = df.groupby('category')['sales'].sum()
# Multiple aggregations at once
summary = df.groupby('category').agg({
'sales': ['sum', 'mean', 'count'],
'quantity': ['sum', 'mean']
})
A Complete Example
Let’s put this together. Here’s a script that analyzes sales data:
import pandas as pd
# Load the data
df = pd.read_csv('sales.csv')
print(f"Total records: {len(df)}")
print(f"Total sales: ${df['amount'].sum():,.2f}")
print(f"Average sale: ${df['amount'].mean():,.2f}")
# Top customers by total spending
top_customers = df.groupby('customer')['amount'].sum().sort_values(ascending=False)
print("\nTop customers:")
print(top_customers.head(10))
# Sales by month
df['month'] = pd.to_datetime(df['date']).dt.month
monthly = df.groupby('month')['amount'].sum()
print("\nMonthly sales:")
print(monthly)
This demonstrates the typical data analysis workflow: load, explore, summarize, and drill down into specific insights.
When to Use Which Tool
Use the csv module when:
- You’re processing simple files with predictable structure
- You need fine-grained control over parsing
- You want minimal dependencies
Use pandas when:
- You need to filter, sort, or aggregate data
- You’re working with multiple files
- You need statistical analysis
- Your data has missing values or mixed types
Summary
CSV files are the universal data format. Python gives you two ways to work with them. The csv module handles simple reading and writing with no dependencies. Pandas provides powerful data analysis with filtering, sorting, grouping, and statistics.
Start with the csv module for basic tasks. Move to pandas when you need real analysis capabilities. Both tools will serve you throughout your data work.