pyguides

pandas Intro: DataFrames and Series

pandas is the backbone of data manipulation in Python. It gives you DataFrames — tabular data with labeled rows and columns — and Series — single-column containers with an index. If you’re working with CSV files, databases, or spreadsheet data, pandas is where you start.

Installing pandas

pip install pandas

Creating a DataFrame

The most common way to create a DataFrame is from a dictionary:

import pandas as pd

data = {
    "name": ["Alice", "Bob", "Carol"],
    "score": [92, 87, 95],
    "city": ["New York", "London", "Tokyo"]
}

df = pd.DataFrame(data)
print(df)
    name  score     city
0  Alice     92  New York
1    Bob     87    London
2  Carol     95     Tokyo

The index is the leftmost column (0, 1, 2…). Each column is a Series.

What is a Series?

A Series is a single column of data with an index:

scores = pd.Series([92, 87, 95], index=["Alice", "Bob", "Carol"])
print(scores)
Alice    92
Bob      87
Carol    95
dtype: int64

DataFrames are dictionaries of Series, one per column.

Selecting Columns

Use bracket notation to grab a column:

df["name"]        # returns a Series
df[["name", "score"]]  # returns a DataFrame

Selecting multiple columns always returns a DataFrame, not a Series.

Selecting Rows

By position with .iloc

.iloc selects by integer position:

df.iloc[0]     # first row
df.iloc[1:3]   # second and third row

By label with .loc

.loc selects by index label:

df.loc[0]      # first row (label 0)
df.loc[1:2]    # rows with labels 1 and 2

With boolean conditions

df[df["score"] > 90]       # rows where score > 90
df[df["city"] == "London"] # rows where city == London

Adding and Modifying Columns

df["pass"] = df["score"] >= 90     # boolean column
df["score_adj"] = df["score"] * 1.1 # curved score

Columns are created or updated in place.

Handling Missing Data

Real data has gaps. pandas uses NaN (Not a Number) to represent missing values:

import numpy as np

df = pd.DataFrame({
    "a": [1, 2, np.nan, 4],
    "b": [5, np.nan, 7, 8]
})

df.isnull()          # shows True/False for missing
df.dropna()          # removes rows with any NaN
df.fillna(0)         # replaces NaN with 0
df["a"].fillna(df["a"].mean())  # fill with column mean

Reading from CSV

The most common pandas operation:

df = pd.read_csv("sales.csv")

# Or from a URL
df = pd.read_csv("https://example.com/data.csv")

read_csv infers column types and reads the header row by default. Use head() to preview:

df.head()     # first 5 rows
df.tail()     # last 5 rows
df.info()     # column types and memory usage
df.describe() # summary stats for numeric columns

Writing to CSV

df.to_csv("output.csv", index=False)  # index=False skips the row index

Basic Operations

df["score"].mean()      # average score
df["score"].median()    # median score
df["score"].min()        # minimum
df["score"].max()        # maximum
df["score"].std()        # standard deviation
df["score"].sum()        # total
df["score"].value_counts()  # frequency of each value

Sorting

df.sort_values("score")              # ascending by score
df.sort_values("score", ascending=False)  # descending
df.sort_index()                       # by row index

Grouping and Aggregating

Split data into groups, then compute statistics:

df = pd.DataFrame({
    "team": ["A", "B", "A", "B"],
    "points": [10, 20, 15, 25]
})

df.groupby("team").sum()
#          points
# team
# A            25
# B            45

df.groupby("team").mean()
#          points
# team
# A          12.5
# B          22.5

Other aggregation functions: min, max, count, std, var.

Chaining Operations

pandas shines when you chain operations:

result = (
    df[df["score"] > 85]
    .groupby("city")
    .agg({"score": "mean", "name": "count"})
    .sort_values("score", ascending=False)
)

This filters, groups, aggregates, and sorts — all in one readable chain.

What’s Next

This tutorial is part of the scientific-python series. In the next installment, you’ll work with real datasets and learn about data cleaning, merging, and visualization.

See Also