Getting Started with Polars
Polars is a blazingly fast DataFrame library written in Rust but exposed to Python. If you have used pandas, Polars will feel familiar — but with significantly better performance and a more intuitive API for common operations.
Why Polars?
Polars was built from the ground up for speed. Unlike pandas, which originated in Python and inherited some inefficiencies, Polars uses:
- Rust for the core computation engine
- Apache Arrow for memory-efficient columnar data representation
- Parallel execution by default across all available CPU cores
In benchmark tests, Polars frequently outperforms pandas by 5-10x on typical DataFrame operations, and the gap widens for larger datasets. But performance is not the only reason to switch:
- Eager and Lazy modes — Polars gives you a query optimizer that can dramatically speed up complex pipelines
- Cleaner API — method chaining feels natural and reduces nested for-loops
- Better type handling — Polars is stricter about data types, catching errors earlier
- No GIL bottleneck — Rust handles parallelism without Python'''s Global Interpreter Lock
Installation
Install Polars with pip:
pip install polars
Or with conda:
conda install polars -c conda-forge
There are two variants: the full-featured polars and the lighter polars-lt for constrained environments. Most users want the full version.
Creating DataFrames
Start by importing Polars:
import polars as pl
Create a DataFrame from a dictionary:
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"city": ["London", "Paris", "Berlin"]
})
print(df)
Basic Operations
Selecting Columns
Use select() to choose columns:
df.select(["name", "age"])
Or use the dot syntax for single columns:
df.name
Filtering Rows
Filter with filter():
df.filter(pl.col("age") > 28)
Adding New Columns
Use with_columns() to add or transform columns:
df.with_columns(
pl.col("age").alias("age_next_year"),
(pl.col("age") * 2).alias("age_doubled")
)
Aggregations with GroupBy
Group and aggregate:
df.group_by("city").agg(
pl.col("age").mean().alias("avg_age"),
pl.col("name").count().alias("count")
)
Lazy vs Eager Execution
Polars has two execution modes:
Eager — executes immediately
Lazy — builds a query plan and optimizes before execution
Switch to lazy mode with .lazy():
lazy_df = df.lazy()
result = (
lazy_df
.filter(pl.col("age") > 25)
.select(["name", "age"])
.collect()
)
For large datasets or complex pipelines, always use lazy mode.
Practical Examples
Reading a CSV and Computing Statistics
import polars as pl
df = pl.read_csv("sales.csv")
summary = (
df.lazy()
.group_by("product_category")
.agg([
pl.col("revenue").sum().alias("total_revenue"),
pl.col("quantity").mean().alias("avg_quantity"),
pl.col("id").n_unique().alias("num_products")
])
.sort("total_revenue", descending=True)
.collect()
)
print(summary)
Handling Missing Data
Polars represents missing values as null:
df = pl.DataFrame({
"a": [1, 2, None, 4],
"b": ["x", None, "z", "w"]
})
# Drop rows with any nulls
df.drop_nulls()
# Fill nulls with a value
df.fill_null(0)
See Also
- pandas DataFrames — comparison guide for pandas users
- NumPy Arrays — foundational array library for numerical Python
- Data Processing — working with CSV files in Python