Pandas Integration with Remote Storage

Pandas leverages fsspec under the hood for cloud storage access (s3://, gs://, etc.). This makes reading from and writing to cloud storage straightforward.

Auto-Detection (Simplest)

Pandas automatically uses fsspec for cloud URIs:

import pandas as pd

Read CSV/Parquet directly from cloud URIs

df = pd.read_csv("s3://bucket/data.csv") df = pd.read_parquet("s3://bucket/data.parquet") df = pd.read_json("gs://bucket/data.json")

Compression is auto-detected

df = pd.read_csv("s3://bucket/data.csv.gz") # Automatically decompressed

Note: Auto-detection uses default credentials. For explicit auth, see below.

Explicit Filesystem (More Control)

import fsspec import pandas as pd

Create fsspec filesystem with configuration

fs = fsspec.filesystem("s3", anon=False) # Uses default credentials chain

Open file through filesystem

with fs.open("s3://bucket/data.csv") as f: df = pd.read_csv(f)

Or pass filesystem directly (recommended for performance)

df = pd.read_parquet( "s3://bucket/data.parquet", filesystem=fs, columns=["id", "value"], # Column pruning reduces data transfer filters=[("date", ">=", "2024-01-01")] # Row group filtering )

PyArrow Filesystem Backend

For better Arrow integration and zero-copy transfers:

import pyarrow.fs as fs import pandas as pd

s3_fs = fs.S3FileSystem(region="us-east-1")

Read with column filtering

df = pd.read_parquet( "bucket/data.parquet", # Note: no s3:// prefix when using filesystem filesystem=s3_fs, columns=["id", "name", "value"] )

Write to cloud storage

df.to_parquet( "s3://bucket/output/", filesystem=s3_fs, partition_cols=["year", "month"] # Partitioned write )

Partitioned Writes

Write partitioned datasets efficiently:

import pandas as pd

df = pd.DataFrame({ "id": [1, 2, 3], "year": [2024, 2024, 2023], "month": [1, 2, 12], "value": [100.0, 200.0, 150.0] })

Using fsspec

fs = fsspec.filesystem("s3") df.to_parquet( "s3://bucket/output/", partition_cols=["year", "month"], filesystem=fs )

Output structure: s3://bucket/output/year=2024/month=1/part-0.parquet

Authentication

Auto-detection: Uses default credential chain (AWS_PROFILE, ~/.aws/credentials, IAM role)
Explicit: Pass key= , secret= to fsspec.filesystem() constructor
For S3-compatible (MinIO, Ceph): fs = fsspec.filesystem("s3", client_kwargs={ "endpoint_url": "http://minio.local:9000" })

See @data-engineering-storage-authentication for detailed patterns.

Performance Tips

Column pruning: pd.read_parquet(columns=[...]) only reads needed columns
Row group filtering: Use filters= parameter for partitioned data
Cache results: Wrap filesystem with simplecache:: or filecache::

cached_fs = fsspec.filesystem("simplecache", target_protocol="s3") df = pd.read_parquet("simplecache::s3://bucket/data.parquet", filesystem=cached_fs)

Use Parquet, not CSV: Parquet supports pushdown, compression, and typed storage
For large datasets: Consider PySpark or Dask instead of pandas (pandas loads everything into memory)

Limitations

pandas loads entire DataFrame into memory - not suitable for datasets larger than RAM
For lazy evaluation and better performance with large files, use @data-engineering-core (Polars)
Multi-file reads require manual iteration (use fs.glob()

list comprehension)

Alternatives

Polars (@data-engineering-core ): Faster, memory-mapped, lazy evaluation
Dask: Parallel pandas for out-of-core computation
PySpark: Distributed processing for big data

References

pandas I/O documentation
fsspec documentation
@data-engineering-storage-remote-access/libraries/fsspec

data-engineering-storage-remote-access-integrations-pandas

Safety Notice

Copy this and send it to your AI assistant to learn

Read CSV/Parquet directly from cloud URIs

Compression is auto-detected

Create fsspec filesystem with configuration

Open file through filesystem

Or pass filesystem directly (recommended for performance)

Read with column filtering

Write to cloud storage

Using fsspec

Output structure: s3://bucket/output/year=2024/month=1/part-0.parquet

Source Transparency

Related Skills

data-science-eda

data-science-feature-engineering

data-science-notebooks

data-engineering-best-practices