Polars Integration with Remote Storage
Polars has native cloud storage support via multiple backends, plus integration with fsspec and PyArrow filesystems.
Native Cloud Access (object_store)
Polars uses the Rust object_store crate internally for direct cloud URI access:
import polars as pl
Read from cloud URIs directly (s3://, gs://, az://)
df = pl.read_parquet("s3://bucket/data/file.parquet") df = pl.read_parquet("gs://bucket/data/file.parquet") df = pl.read_csv("s3://bucket/data/file.csv.gz", infer_schema_length=1000)
Lazy scanning with predicate and column pushdown
lazy_df = pl.scan_parquet("s3://bucket/dataset/**/*.parquet") result = ( lazy_df .filter(pl.col("date") > "2024-01-01") # Pushed to storage layer .group_by("category") .agg([ pl.col("value").sum().alias("total_value"), pl.col("id").count().alias("count") ]) .collect() )
Write to cloud storage
df.write_parquet("s3://bucket/output/data.parquet")
Partitioned write (Hive-style)
df.write_parquet( "s3://bucket/output/", partition_by=["year", "month"], use_pyarrow=True # Requires PyArrow )
Supported protocols: s3:// , gs:// , az:// , file://
Via fsspec
Use fsspec for broader compatibility and protocol chaining:
import fsspec import polars as pl
Create fsspec filesystem
fs = fsspec.filesystem("s3", config_kwargs={"region": "us-east-1"})
Open file through fsspec
with fs.open("s3://bucket/data.csv") as f: df = pl.read_csv(f)
Use fsspec caching wrapper
cached_fs = fsspec.filesystem( "simplecache", target_protocol="s3", target_options={"anon": False} ) df = pl.read_parquet("simplecache::s3://bucket/cached.parquet")
Via PyArrow Dataset (Advanced)
For Hive-partitioned datasets with complex pushdown:
import pyarrow.fs as fs import pyarrow.dataset as ds import polars as pl
s3_fs = fs.S3FileSystem(region="us-east-1")
Load partitioned dataset
dataset = ds.dataset( "bucket/dataset/", filesystem=s3_fs, format="parquet", partitioning=ds.HivePartitioning.discover() )
Convert to Polars lazy frame
lazy_df = pl.scan_pyarrow_dataset(dataset)
Query with full pushdown
result = ( lazy_df .filter((pl.col("year") == 2024) & (pl.col("month") <= 6)) .select(["id", "value", "timestamp"]) .collect() )
Authentication
Native Polars cloud access inherits credentials from:
-
AWS: Environment variables (AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY ), ~/.aws/credentials , IAM roles
-
GCP: GOOGLE_APPLICATION_CREDENTIALS , gcloud CLI, metadata server
-
Azure: AZURE_STORAGE_ACCOUNT , AZURE_STORAGE_KEY , managed identity
For explicit credentials, use fsspec or PyArrow filesystem constructors.
Performance Tips
-
✅ Use native s3:// URIs for best performance (direct object_store usage)
-
✅ Lazy evaluation with predicates for pushdown
-
✅ Partitioned writes for large datasets (avoid huge single files)
-
✅ Column selection in lazy queries to read only needed data
-
⚠️ For complex authentication (SSO, temporary creds), use fsspec/ PyArrow constructors
-
⚠️ For caching, use fsspec's simplecache:: or filecache:: wrappers
Common Patterns
Incremental Load from Partitioned Data
Only read recent partitions
lazy_df = pl.scan_parquet("s3://bucket/events/") last_month = datetime.now() - timedelta(days=30)
result = ( lazy_df .filter(pl.col("date") >= last_month) .collect() )
Cross-Cloud Copy
Read from S3, write to GCS (Polars doesn't support mixed URIs directly)
Use PyArrow bridge:
import pyarrow.fs as fs import pyarrow.dataset as ds
s3 = fs.S3FileSystem() gcs = fs.GcsFileSystem()
dataset = ds.dataset("s3://bucket/input/", filesystem=s3, format="parquet") table = dataset.to_table() gcs_file = fs.GcsFileSystem().open_output_stream("gs://bucket/output.parquet") pq.write_table(table, gcs_file)
References
-
Polars Cloud Storage Guide
-
Polars File System Backends
-
@data-engineering-storage-remote-access/libraries/fsspec
-
fsspec usage
-
@data-engineering-storage-remote-access/libraries/pyarrow-fs
-
PyArrow filesystem