Polars Integration with Remote Storage

Polars has native cloud storage support via multiple backends, plus integration with fsspec and PyArrow filesystems.

Native Cloud Access (object_store)

Polars uses the Rust object_store crate internally for direct cloud URI access:

import polars as pl

Read from cloud URIs directly (s3://, gs://, az://)

df = pl.read_parquet("s3://bucket/data/file.parquet") df = pl.read_parquet("gs://bucket/data/file.parquet") df = pl.read_csv("s3://bucket/data/file.csv.gz", infer_schema_length=1000)

Lazy scanning with predicate and column pushdown

lazy_df = pl.scan_parquet("s3://bucket/dataset/**/*.parquet") result = ( lazy_df .filter(pl.col("date") > "2024-01-01") # Pushed to storage layer .group_by("category") .agg([ pl.col("value").sum().alias("total_value"), pl.col("id").count().alias("count") ]) .collect() )

Write to cloud storage

df.write_parquet("s3://bucket/output/data.parquet")

Partitioned write (Hive-style)

df.write_parquet( "s3://bucket/output/", partition_by=["year", "month"], use_pyarrow=True # Requires PyArrow )

Supported protocols: s3:// , gs:// , az:// , file://

Via fsspec

Use fsspec for broader compatibility and protocol chaining:

import fsspec import polars as pl

Create fsspec filesystem

fs = fsspec.filesystem("s3", config_kwargs={"region": "us-east-1"})

Open file through fsspec

with fs.open("s3://bucket/data.csv") as f: df = pl.read_csv(f)

Use fsspec caching wrapper

cached_fs = fsspec.filesystem( "simplecache", target_protocol="s3", target_options={"anon": False} ) df = pl.read_parquet("simplecache::s3://bucket/cached.parquet")

Via PyArrow Dataset (Advanced)

For Hive-partitioned datasets with complex pushdown:

import pyarrow.fs as fs import pyarrow.dataset as ds import polars as pl

s3_fs = fs.S3FileSystem(region="us-east-1")

Load partitioned dataset

dataset = ds.dataset( "bucket/dataset/", filesystem=s3_fs, format="parquet", partitioning=ds.HivePartitioning.discover() )

Convert to Polars lazy frame

lazy_df = pl.scan_pyarrow_dataset(dataset)

Query with full pushdown

result = ( lazy_df .filter((pl.col("year") == 2024) & (pl.col("month") <= 6)) .select(["id", "value", "timestamp"]) .collect() )

Authentication

Native Polars cloud access inherits credentials from:

AWS: Environment variables (AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY ), ~/.aws/credentials , IAM roles
GCP: GOOGLE_APPLICATION_CREDENTIALS , gcloud CLI, metadata server
Azure: AZURE_STORAGE_ACCOUNT , AZURE_STORAGE_KEY , managed identity

For explicit credentials, use fsspec or PyArrow filesystem constructors.

Performance Tips

✅ Use native s3:// URIs for best performance (direct object_store usage)
✅ Lazy evaluation with predicates for pushdown
✅ Partitioned writes for large datasets (avoid huge single files)
✅ Column selection in lazy queries to read only needed data
⚠️ For complex authentication (SSO, temporary creds), use fsspec/ PyArrow constructors
⚠️ For caching, use fsspec's simplecache:: or filecache:: wrappers

Common Patterns

Incremental Load from Partitioned Data

Only read recent partitions

lazy_df = pl.scan_parquet("s3://bucket/events/") last_month = datetime.now() - timedelta(days=30)

result = ( lazy_df .filter(pl.col("date") >= last_month) .collect() )

Cross-Cloud Copy

Read from S3, write to GCS (Polars doesn't support mixed URIs directly)

Use PyArrow bridge:

import pyarrow.fs as fs import pyarrow.dataset as ds

s3 = fs.S3FileSystem() gcs = fs.GcsFileSystem()

dataset = ds.dataset("s3://bucket/input/", filesystem=s3, format="parquet") table = dataset.to_table() gcs_file = fs.GcsFileSystem().open_output_stream("gs://bucket/output.parquet") pq.write_table(table, gcs_file)

References

Polars Cloud Storage Guide
Polars File System Backends
@data-engineering-storage-remote-access/libraries/fsspec
fsspec usage
@data-engineering-storage-remote-access/libraries/pyarrow-fs
PyArrow filesystem

data-engineering-storage-remote-access-integrations-polars

Safety Notice

Copy this and send it to your AI assistant to learn