data-engineering-storage-remote-access-integrations-polars

Polars Integration with Remote Storage

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-engineering-storage-remote-access-integrations-polars" with this command: npx skills add legout/data-platform-agent-skills/legout-data-platform-agent-skills-data-engineering-storage-remote-access-integrations-polars

Polars Integration with Remote Storage

Polars has native cloud storage support via multiple backends, plus integration with fsspec and PyArrow filesystems.

Native Cloud Access (object_store)

Polars uses the Rust object_store crate internally for direct cloud URI access:

import polars as pl

Read from cloud URIs directly (s3://, gs://, az://)

df = pl.read_parquet("s3://bucket/data/file.parquet") df = pl.read_parquet("gs://bucket/data/file.parquet") df = pl.read_csv("s3://bucket/data/file.csv.gz", infer_schema_length=1000)

Lazy scanning with predicate and column pushdown

lazy_df = pl.scan_parquet("s3://bucket/dataset/**/*.parquet") result = ( lazy_df .filter(pl.col("date") > "2024-01-01") # Pushed to storage layer .group_by("category") .agg([ pl.col("value").sum().alias("total_value"), pl.col("id").count().alias("count") ]) .collect() )

Write to cloud storage

df.write_parquet("s3://bucket/output/data.parquet")

Partitioned write (Hive-style)

df.write_parquet( "s3://bucket/output/", partition_by=["year", "month"], use_pyarrow=True # Requires PyArrow )

Supported protocols: s3:// , gs:// , az:// , file://

Via fsspec

Use fsspec for broader compatibility and protocol chaining:

import fsspec import polars as pl

Create fsspec filesystem

fs = fsspec.filesystem("s3", config_kwargs={"region": "us-east-1"})

Open file through fsspec

with fs.open("s3://bucket/data.csv") as f: df = pl.read_csv(f)

Use fsspec caching wrapper

cached_fs = fsspec.filesystem( "simplecache", target_protocol="s3", target_options={"anon": False} ) df = pl.read_parquet("simplecache::s3://bucket/cached.parquet")

Via PyArrow Dataset (Advanced)

For Hive-partitioned datasets with complex pushdown:

import pyarrow.fs as fs import pyarrow.dataset as ds import polars as pl

s3_fs = fs.S3FileSystem(region="us-east-1")

Load partitioned dataset

dataset = ds.dataset( "bucket/dataset/", filesystem=s3_fs, format="parquet", partitioning=ds.HivePartitioning.discover() )

Convert to Polars lazy frame

lazy_df = pl.scan_pyarrow_dataset(dataset)

Query with full pushdown

result = ( lazy_df .filter((pl.col("year") == 2024) & (pl.col("month") <= 6)) .select(["id", "value", "timestamp"]) .collect() )

Authentication

Native Polars cloud access inherits credentials from:

  • AWS: Environment variables (AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY ), ~/.aws/credentials , IAM roles

  • GCP: GOOGLE_APPLICATION_CREDENTIALS , gcloud CLI, metadata server

  • Azure: AZURE_STORAGE_ACCOUNT , AZURE_STORAGE_KEY , managed identity

For explicit credentials, use fsspec or PyArrow filesystem constructors.

Performance Tips

  • ✅ Use native s3:// URIs for best performance (direct object_store usage)

  • ✅ Lazy evaluation with predicates for pushdown

  • ✅ Partitioned writes for large datasets (avoid huge single files)

  • ✅ Column selection in lazy queries to read only needed data

  • ⚠️ For complex authentication (SSO, temporary creds), use fsspec/ PyArrow constructors

  • ⚠️ For caching, use fsspec's simplecache:: or filecache:: wrappers

Common Patterns

Incremental Load from Partitioned Data

Only read recent partitions

lazy_df = pl.scan_parquet("s3://bucket/events/") last_month = datetime.now() - timedelta(days=30)

result = ( lazy_df .filter(pl.col("date") >= last_month) .collect() )

Cross-Cloud Copy

Read from S3, write to GCS (Polars doesn't support mixed URIs directly)

Use PyArrow bridge:

import pyarrow.fs as fs import pyarrow.dataset as ds

s3 = fs.S3FileSystem() gcs = fs.GcsFileSystem()

dataset = ds.dataset("s3://bucket/input/", filesystem=s3, format="parquet") table = dataset.to_table() gcs_file = fs.GcsFileSystem().open_output_stream("gs://bucket/output.parquet") pq.write_table(table, gcs_file)

References

  • Polars Cloud Storage Guide

  • Polars File System Backends

  • @data-engineering-storage-remote-access/libraries/fsspec

  • fsspec usage

  • @data-engineering-storage-remote-access/libraries/pyarrow-fs

  • PyArrow filesystem

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

data-science-eda

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-visualization

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering-core

No summary provided by upstream source.

Repository SourceNeeds Review