pandas Intro: DataFrames and Series
pandas is the backbone of data manipulation in Python. It gives you DataFrames — tabular data with labeled rows and columns — and Series — single-column containers with an index. If you’re working with CSV files, databases, or spreadsheet data, pandas is where you start.
Installing pandas
pip install pandas
Creating a DataFrame
The most common way to create a DataFrame is from a dictionary:
import pandas as pd
data = {
"name": ["Alice", "Bob", "Carol"],
"score": [92, 87, 95],
"city": ["New York", "London", "Tokyo"]
}
df = pd.DataFrame(data)
print(df)
name score city
0 Alice 92 New York
1 Bob 87 London
2 Carol 95 Tokyo
The index is the leftmost column (0, 1, 2…). Each column is a Series.
What is a Series?
A Series is a single column of data with an index:
scores = pd.Series([92, 87, 95], index=["Alice", "Bob", "Carol"])
print(scores)
Alice 92
Bob 87
Carol 95
dtype: int64
DataFrames are dictionaries of Series, one per column.
Selecting Columns
Use bracket notation to grab a column:
df["name"] # returns a Series
df[["name", "score"]] # returns a DataFrame
Selecting multiple columns always returns a DataFrame, not a Series.
Selecting Rows
By position with .iloc
.iloc selects by integer position:
df.iloc[0] # first row
df.iloc[1:3] # second and third row
By label with .loc
.loc selects by index label:
df.loc[0] # first row (label 0)
df.loc[1:2] # rows with labels 1 and 2
With boolean conditions
df[df["score"] > 90] # rows where score > 90
df[df["city"] == "London"] # rows where city == London
Adding and Modifying Columns
df["pass"] = df["score"] >= 90 # boolean column
df["score_adj"] = df["score"] * 1.1 # curved score
Columns are created or updated in place.
Handling Missing Data
Real data has gaps. pandas uses NaN (Not a Number) to represent missing values:
import numpy as np
df = pd.DataFrame({
"a": [1, 2, np.nan, 4],
"b": [5, np.nan, 7, 8]
})
df.isnull() # shows True/False for missing
df.dropna() # removes rows with any NaN
df.fillna(0) # replaces NaN with 0
df["a"].fillna(df["a"].mean()) # fill with column mean
Reading from CSV
The most common pandas operation:
df = pd.read_csv("sales.csv")
# Or from a URL
df = pd.read_csv("https://example.com/data.csv")
read_csv infers column types and reads the header row by default. Use head() to preview:
df.head() # first 5 rows
df.tail() # last 5 rows
df.info() # column types and memory usage
df.describe() # summary stats for numeric columns
Writing to CSV
df.to_csv("output.csv", index=False) # index=False skips the row index
Basic Operations
df["score"].mean() # average score
df["score"].median() # median score
df["score"].min() # minimum
df["score"].max() # maximum
df["score"].std() # standard deviation
df["score"].sum() # total
df["score"].value_counts() # frequency of each value
Sorting
df.sort_values("score") # ascending by score
df.sort_values("score", ascending=False) # descending
df.sort_index() # by row index
Grouping and Aggregating
Split data into groups, then compute statistics:
df = pd.DataFrame({
"team": ["A", "B", "A", "B"],
"points": [10, 20, 15, 25]
})
df.groupby("team").sum()
# points
# team
# A 25
# B 45
df.groupby("team").mean()
# points
# team
# A 12.5
# B 22.5
Other aggregation functions: min, max, count, std, var.
Chaining Operations
pandas shines when you chain operations:
result = (
df[df["score"] > 85]
.groupby("city")
.agg({"score": "mean", "name": "count"})
.sort_values("score", ascending=False)
)
This filters, groups, aggregates, and sorts — all in one readable chain.
What’s Next
This tutorial is part of the scientific-python series. In the next installment, you’ll work with real datasets and learn about data cleaning, merging, and visualization.
See Also
- /guides/ds-numpy-basics/ — NumPy arrays as the foundation for pandas
- /tutorials/python-and-data/csv-to-analysis/ — reading and analyzing CSV data with pandas
- /guides/pandas-dataframes/ — deeper coverage of DataFrame operations