polars

Polars Fast DataFrame Library

Lightning-fast DataFrame library with lazy evaluation and parallel execution.

When to Use

Pandas is too slow for your dataset
Working with 1-100GB datasets that fit in RAM
Need lazy evaluation for query optimization
Building ETL pipelines
Want parallel execution without extra config

Lazy vs Eager Evaluation

Mode Function Executes Use Case

Eager read_csv()

Immediately Small data, exploration

Lazy scan_csv()

On .collect()

Large data, pipelines

Key concept: Lazy mode builds a query plan that gets optimized before execution. The optimizer applies predicate pushdown (filter early) and projection pushdown (select columns early).

Core Operations

Data Selection

Operation Purpose

select()

Choose columns

filter()

Choose rows by condition

with_columns()

Add/modify columns

drop()

Remove columns

head(n) / tail(n)

First/last n rows

Aggregation

Operation Purpose

group_by().agg()

Group and aggregate

pivot()

Reshape wide

melt()

Reshape long

unique()

Distinct values

Joins

Join Type Description

inner Matching rows only

left All left + matching right

outer All rows from both

cross Cartesian product

semi Left rows with match

anti Left rows without match

Expression API

Key concept: Polars uses expressions (pl.col() ) instead of indexing. Expressions are lazily evaluated and optimized.

Common Expressions

Expression Purpose

pl.col("name")

Reference column

pl.lit(value)

Literal value

pl.all()

All columns

pl.exclude(...)

All except

Expression Methods

Category Methods

Aggregation .sum() , .mean() , .min() , .max() , .count()

String .str.contains() , .str.replace() , .str.to_lowercase()

DateTime .dt.year() , .dt.month() , .dt.day()

Conditional .when().then().otherwise()

Window .over() , .rolling_mean() , .shift()

Pandas Migration

Pandas Polars

df['col']

df.select('col')

df[df['col'] > 5]

df.filter(pl.col('col') > 5)

df['new'] = df['col'] * 2

df.with_columns((pl.col('col') * 2).alias('new'))

df.groupby('col').mean()

df.group_by('col').agg(pl.all().mean())

df.apply(func)

df.map_rows(func) (avoid if possible)

Key concept: Polars prefers explicit operations over implicit indexing. Use .alias() to name computed columns.

File I/O

Format Read Write Notes

CSV read_csv() / scan_csv()

write_csv()

Human readable

Parquet read_parquet() / scan_parquet()

write_parquet()

Fast, compressed

JSON read_json() / scan_ndjson()

write_json()

Newline-delimited

IPC/Arrow read_ipc() / scan_ipc()

write_ipc()

Zero-copy

Key concept: Use Parquet for performance. Use scan_* for large files to enable lazy optimization.

Performance Tips

Tip Why

Use lazy mode Query optimization

Use Parquet Column-oriented, compressed

Select columns early Projection pushdown

Filter early Predicate pushdown

Avoid Python UDFs Breaks parallelism

Use expressions Vectorized operations

Set dtypes on read Avoid inference overhead

vs Alternatives

Tool Best For Limitations

Polars 1-100GB, speed critical Must fit in RAM

Pandas Small data, ecosystem Slow, memory hungry

Dask Larger than RAM More complex API

Spark Cluster computing Infrastructure overhead

DuckDB SQL interface Different API style

Resources

Docs: https://pola.rs/
User Guide: https://docs.pola.rs/user-guide/
Cookbook: https://docs.pola.rs/user-guide/misc/cookbook/

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

document-processing

stripe-payments

file-organization

literature-review