Getting Started with pandas

· 4 min read · Updated March 13, 2026 · beginner
pandas data-analysis dataframes beginner

pandas is the go-to library for data analysis in Python. If you are working with tabular data-spreadsheets, SQL exports, CSV files-pandas makes it incredibly easy to load, explore, transform, and analyze that data. It builds on NumPy and is the backbone of the Python data science ecosystem. Whether you are a beginner learning data analysis or an experienced developer working with large datasets, pandas provides the tools you need to get the job done efficiently.

What is pandas?

pandas provides two main data structures: Series and DataFrame. A Series is like a single column of data, while a DataFrame is a table with rows and columns. Think of it as Python answer to a spreadsheet or SQL table.

The library excels at:

  • Loading data from CSV, Excel, JSON, and SQL databases
  • Cleaning and transforming messy data
  • Aggregating and summarizing data
  • Handling missing data gracefully
  • Time series analysis

Understanding these core data structures is essential because all pandas operations build upon them. Once you master Series and DataFrames, you can tackle virtually any data analysis task.

Installing pandas

Install pandas using pip:

pip install pandas

Or with conda if you prefer the Anaconda ecosystem:

conda install pandas

Once installed, import pandas with its conventional alias:

import pandas as pd

The pd alias appears in virtually every pandas tutorial and documentation, so use it consistently to make your code familiar to others.

Creating DataFrames

From a Dictionary

The most common way to create a DataFrame is from a dictionary:

import pandas as pd

data = {
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "age": [25, 30, 35, 28],
    "city": ["New York", "San Francisco", "Chicago", "Seattle"]
}

df = pd.DataFrame(data)
print(df)

The dictionary keys become column names, and each value list becomes a column. The number of elements in each list must match.

From CSV

Loading data from a file is just as easy:

df = pd.read_csv("data.csv")

pandas can also read Excel, JSON, HTML, and SQL queries directly. This flexibility makes it simple to work with data from virtually any source.

Exploring Data

Once you have a DataFrame, you will want to understand its structure:

# First few rows
print(df.head())

# Last few rows
print(df.tail())

# Data types
print(df.dtypes)

# Basic statistics
print(df.describe())

# Shape (rows, columns)
print(df.shape)

# Column names
print(df.columns)

These methods give you a quick overview of your data. The describe method is particularly useful for understanding the distribution of numerical columns.

Selecting Data

Selecting Columns

Access a single column:

df["name"]
df.name

Access multiple columns:

df[["name", "age"]]

Selecting Rows

Filter rows with boolean indexing:

# Find people over 30
df[df["age"] > 30]

# Multiple conditions
df[(df["age"] > 25) & (df["city"] == "New York")]

Use .loc for label-based selection:

df.loc[0]
df.loc[0:2, ["name", "city"]]

Use .iloc for integer-based selection:

df.iloc[0:2, 0:2]

Adding and Modifying Data

Adding Columns

df["country"] = "USA"
df["age_in_5_years"] = df["age"] + 5

Modifying Values

df.loc[0, "age"] = 26

Deleting Columns

df.drop("country", axis=1, inplace=True)

Handling Missing Data

Real-world data often has missing values. pandas uses NaN to represent them:

import numpy as np

data = {"name": ["Alice", "Bob", "Charlie"], 
        "age": [25, np.nan, 35]}
df = pd.DataFrame(data)

# Check for missing values
print(df.isnull())

# Fill missing values
df["age"].fillna(0, inplace=True)

# Drop rows with missing values
df.dropna(inplace=True)

Understanding how to handle missing data is crucial because real datasets almost always have gaps.

Grouping and Aggreging

Group data and calculate aggregates:

data = {"department": ["Sales", "Sales", "Engineering", "Engineering"],
        "salary": [50000, 60000, 80000, 90000]}
df = pd.DataFrame(data)

# Group by department
grouped = df.groupby("department")

# Calculate mean salary by department
print(grouped["salary"].mean())

Groupby is one of pandas most powerful features for summarizing data.

Saving Data

Write data back to files:

df.to_csv("output.csv", index=False)
df.to_excel("output.xlsx", index=False)
df.to_json("output.json", orient="records")

The index=False parameter prevents pandas from writing row numbers to the file.

Common Pitfalls

Be aware of these common mistakes:

  1. Forgetting that slicing returns a view, not a copy
  2. Not handling missing data before analysis
  3. Using loops instead of vectorized operations
  4. Forgetting to set inplace=True when modifying DataFrames

See Also