Polars
Overview
Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.
Quick Start
Installation and Basic Usage
Install Polars:
pip install polars
Basic DataFrame creation and operations:
import polars as pl
Create DataFrame
df = pl.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "city": ["NY", "LA", "SF"] })
Select columns
df.select("name", "age")
Filter rows
df.filter(pl.col("age") > 25)
Add computed columns
df.with_columns( age_plus_10=pl.col("age") + 10 )
Core Concepts
Expressions
Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.
Key principles:
-
Use pl.col("column_name") to reference columns
-
Chain methods to build complex transformations
-
Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)
Example:
Expression-based computation
df.select( pl.col("name"), (pl.col("age") * 12).alias("age_in_months") )
Lazy vs Eager Evaluation
Eager (DataFrame): Operations execute immediately
df = pl.read_csv("file.csv") # Reads immediately result = df.filter(pl.col("age") > 25) # Executes immediately
Lazy (LazyFrame): Operations build a query plan, optimized before execution
lf = pl.scan_csv("file.csv") # Doesn't read yet result = lf.filter(pl.col("age") > 25).select("name", "age") df = result.collect() # Now executes optimized query
When to use lazy:
-
Working with large datasets
-
Complex query pipelines
-
When only some columns/rows are needed
-
Performance is critical
Benefits of lazy evaluation:
-
Automatic query optimization
-
Predicate pushdown
-
Projection pushdown
-
Parallel execution
For detailed concepts, load references/core_concepts.md .
Common Operations
Select
Select and manipulate columns:
Select specific columns
df.select("name", "age")
Select with expressions
df.select( pl.col("name"), (pl.col("age") * 2).alias("double_age") )
Select all columns matching a pattern
df.select(pl.col("^.*_id$"))
Filter
Filter rows by conditions:
Single condition
df.filter(pl.col("age") > 25)
Multiple conditions (cleaner than using &)
df.filter( pl.col("age") > 25, pl.col("city") == "NY" )
Complex conditions
df.filter( (pl.col("age") > 25) | (pl.col("city") == "LA") )
With Columns
Add or modify columns while preserving existing ones:
Add new columns
df.with_columns( age_plus_10=pl.col("age") + 10, name_upper=pl.col("name").str.to_uppercase() )
Parallel computation (all columns computed in parallel)
df.with_columns( pl.col("value") * 10, pl.col("value") * 100, )
Group By and Aggregations
Group data and compute aggregations:
Basic grouping
df.group_by("city").agg( pl.col("age").mean().alias("avg_age"), pl.len().alias("count") )
Multiple group keys
df.group_by("city", "department").agg( pl.col("salary").sum() )
Conditional aggregations
df.group_by("city").agg( (pl.col("age") > 30).sum().alias("over_30") )
For detailed operation patterns, load references/operations.md .
Aggregations and Window Functions
Aggregation Functions
Common aggregations within group_by context:
-
pl.len()
-
count rows
-
pl.col("x").sum()
-
sum values
-
pl.col("x").mean()
-
average
-
pl.col("x").min() / pl.col("x").max()
-
extremes
-
pl.first() / pl.last()
-
first/last values
Window Functions with over()
Apply aggregations while preserving row count:
Add group statistics to each row
df.with_columns( avg_age_by_city=pl.col("age").mean().over("city"), rank_in_city=pl.col("salary").rank().over("city") )
Multiple grouping columns
df.with_columns( group_avg=pl.col("value").mean().over("category", "region") )
Mapping strategies:
-
group_to_rows (default): Preserves original row order
-
explode : Faster but groups rows together
-
join : Creates list columns
Data I/O
Supported Formats
Polars supports reading and writing:
-
CSV, Parquet, JSON, Excel
-
Databases (via connectors)
-
Cloud storage (S3, Azure, GCS)
-
Google BigQuery
-
Multiple/partitioned files
Common I/O Operations
CSV:
Eager
df = pl.read_csv("file.csv") df.write_csv("output.csv")
Lazy (preferred for large files)
lf = pl.scan_csv("file.csv") result = lf.filter(...).select(...).collect()
Parquet (recommended for performance):
df = pl.read_parquet("file.parquet") df.write_parquet("output.parquet")
JSON:
df = pl.read_json("file.json") df.write_json("output.json")
For comprehensive I/O documentation, load references/io_guide.md .
Transformations
Joins
Combine DataFrames:
Inner join
df1.join(df2, on="id", how="inner")
Left join
df1.join(df2, on="id", how="left")
Join on different column names
df1.join(df2, left_on="user_id", right_on="id")
Concatenation
Stack DataFrames:
Vertical (stack rows)
pl.concat([df1, df2], how="vertical")
Horizontal (add columns)
pl.concat([df1, df2], how="horizontal")
Diagonal (union with different schemas)
pl.concat([df1, df2], how="diagonal")
Pivot and Unpivot
Reshape data:
Pivot (wide format)
df.pivot(values="sales", index="date", columns="product")
Unpivot (long format)
df.unpivot(index="id", on=["col1", "col2"])
For detailed transformation examples, load references/transformations.md .
Pandas Migration
Polars offers significant performance improvements over pandas with a cleaner API. Key differences:
Conceptual Differences
-
No index: Polars uses integer positions only
-
Strict typing: No silent type conversions
-
Lazy evaluation: Available via LazyFrame
-
Parallel by default: Operations parallelized automatically
Common Operation Mappings
Operation Pandas Polars
Select column df["col"]
df.select("col")
Filter df[df["col"] > 10]
df.filter(pl.col("col") > 10)
Add column df.assign(x=...)
df.with_columns(x=...)
Group by df.groupby("col").agg(...)
df.group_by("col").agg(...)
Window df.groupby("col").transform(...)
df.with_columns(...).over("col")
Key Syntax Patterns
Pandas sequential (slow):
df.assign( col_a=lambda df_: df_.value * 10, col_b=lambda df_: df_.value * 100 )
Polars parallel (fast):
df.with_columns( col_a=pl.col("value") * 10, col_b=pl.col("value") * 100, )
For comprehensive migration guide, load references/pandas_migration.md .
Best Practices
Performance Optimization
Use lazy evaluation for large datasets:
lf = pl.scan_csv("large.csv") # Don't use read_csv result = lf.filter(...).select(...).collect()
Avoid Python functions in hot paths:
-
Stay within expression API for parallelization
-
Use .map_elements() only when necessary
-
Prefer native Polars operations
Use streaming for very large data:
lf.collect(streaming=True)
Select only needed columns early:
Good: Select columns early
lf.select("col1", "col2").filter(...)
Bad: Filter on all columns first
lf.filter(...).select("col1", "col2")
Use appropriate data types:
-
Categorical for low-cardinality strings
-
Appropriate integer sizes (i32 vs i64)
-
Date types for temporal data
Expression Patterns
Conditional operations:
pl.when(condition).then(value).otherwise(other_value)
Column operations across multiple columns:
df.select(pl.col("^.*_value$") * 2) # Regex pattern
Null handling:
pl.col("x").fill_null(0) pl.col("x").is_null() pl.col("x").drop_nulls()
For additional best practices and patterns, load references/best_practices.md .
Resources
This skill includes comprehensive reference documentation:
references/
-
core_concepts.md
-
Detailed explanations of expressions, lazy evaluation, and type system
-
operations.md
-
Comprehensive guide to all common operations with examples
-
pandas_migration.md
-
Complete migration guide from pandas to Polars
-
io_guide.md
-
Data I/O operations for all supported formats
-
transformations.md
-
Joins, concatenation, pivots, and reshaping operations
-
best_practices.md
-
Performance optimization tips and common patterns
Load these references as needed when users require detailed information about specific topics.