data-engineering-storage-remote-access-integrations-pyarrow

PyArrow Remote Storage Integration

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-engineering-storage-remote-access-integrations-pyarrow" with this command: npx skills add legout/data-platform-agent-skills/legout-data-platform-agent-skills-data-engineering-storage-remote-access-integrations-pyarrow

PyArrow Remote Storage Integration

PyArrow's parquet and dataset modules work seamlessly with cloud storage through its native filesystem abstraction and fsspec compatibility.

Native PyArrow Filesystem

import pyarrow.parquet as pq import pyarrow.dataset as ds import pyarrow.fs as fs

Create S3 filesystem

s3_fs = fs.S3FileSystem(region="us-east-1")

Read single file with column filtering

table = pq.read_table( "bucket/file.parquet", # Note: no s3:// prefix filesystem=s3_fs, columns=["id", "value"] # Column pruning )

Dataset with filtering and partitioning

dataset = ds.dataset( "bucket/dataset/", filesystem=s3_fs, format="parquet", partitioning=ds.HivePartitioning.discover() )

Filter pushdown (only reads matching files/row groups)

table = dataset.to_table( filter=(ds.field("year") == 2024) & (ds.field("value") > 100), columns=["id", "value", "timestamp"] )

Batch scanning for large datasets

scanner = dataset.scanner( filter=ds.field("value") > 0, batch_size=65536, use_threads=True ) for batch in scanner.to_batches(): process(batch)

fsspec Integration

PyArrow automatically bridges to fsspec for Parquet files:

import fsspec import pyarrow.parquet as pq

fs = fsspec.filesystem("s3")

Open via fsspec

with fs.open("s3://bucket/file.parquet", "rb") as f: table = pq.read_table(f)

Or use URI directly (fsspec auto-detected if installed)

table = pq.read_table("s3://bucket/file.parquet")

obstore fsspec Wrapper

Use obstore's high-performance fsspec wrapper for concurrent operations:

from obstore.fsspec import FsspecStore import pyarrow.parquet as pq

Create obstore-backed fsspec filesystem

fs = FsspecStore("s3", bucket="my-bucket", region="us-east-1")

Use with PyArrow

table = pq.read_table("data/file.parquet", filesystem=fs)

Dataset Scanning Patterns

See @data-engineering-storage-remote-access/patterns.md for advanced patterns including:

  • Incremental loading with checkpoint tracking

  • Partitioned writes with Hive partitioning

  • Cross-cloud copying

  • Performance optimizations (predicate pushdown, column pruning)

Authentication

See @data-engineering-storage-authentication for S3, GCS, Azure credential configuration with PyArrow filesystems.

Performance Tips

  • Column pruning: Always specify columns=[...] to reduce data transfer

  • Filter pushdown: Use dataset.scanner(filter=...) for predicate pushdown

  • Row group pruning: Parquet row groups enable partial file reads

  • Threading: Enable use_threads=True in scanner for CPU-bound ops

  • Batch size: Tune batch_size based on downstream processing needs

  • File format: Prefer Parquet over CSV/JSON for compression and pushdown

References

  • PyArrow Filesystems Guide

  • PyArrow Dataset Guide

  • @data-engineering-storage-remote-access/libraries/pyarrow-fs

  • PyArrow.fs library details

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

data-science-eda

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-feature-engineering

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering-core

No summary provided by upstream source.

Repository SourceNeeds Review