fsspec: Universal Filesystem Interface
fsspec provides a unified API for local and remote filesystems, integrating seamlessly with pandas, xarray, Dask, and many other Python data tools.
Installation
Core only (no remote support)
pip install fsspec
With specific backends
pip install fsspec[s3] # S3 via s3fs pip install fsspec[gcs] # GCS via gcsfs pip install fsspec[s3,gcs,azure] # Multiple backends
Or install backends directly
pip install s3fs gcsfs adlfs
Basic Usage
import fsspec import pandas as pd
List available protocols
print(fsspec.available_protocols())
['file', 'memory', 'http', 'https', 's3', 's3a', 'gcs', 'gs', 'abfss', ...]
Create filesystem instances
local_fs = fsspec.filesystem('file') s3_fs = fsspec.filesystem('s3', anon=False) # Uses boto3 credentials gcs_fs = fsspec.filesystem('gcs') # Uses GCP credentials
Basic operations
s3_fs.ls('my-bucket/data/') # List files s3_fs.exists('my-bucket/data/file.csv') # Check existence s3_fs.mkdir('my-bucket/new-folder') # Create directory
Read file as bytes
with s3_fs.open('s3://my-bucket/data/file.txt', 'rb') as f: content = f.read()
Read CSV directly into pandas
with s3_fs.open('s3://my-bucket/data/large.csv', 'rb') as f: df = pd.read_csv(f, compression='gzip')
Protocol Chaining & Caching
SimpleCache: Cache remote files locally for faster repeated access
import fsspec
First read downloads, subsequent reads use cache
cached_file = fsspec.open_local( "simplecache::s3://my-bucket/large-file.nc", simplecache={'cache_storage': '/tmp/fsspec_cache', 'compression': None} )
Chain multiple protocols
Read from HTTPS, cache locally, decompress on the fly
with fsspec.open( "simplecache::gzip::https://example.com/data.csv.gz", compression='gzip' ) as f: df = pd.read_csv(f)
Other useful wrappers:
- "filecache::" - Persistent disk cache
- "gzip::" - Decompression
- "zip::" - Zip file access
Advanced S3 Features
import s3fs
Detailed S3 configuration
fs = s3fs.S3FileSystem( key='AKIA...', secret='...', token='...', # Temporary session token client_kwargs={ 'region_name': 'us-east-1', 'endpoint_url': 'https://s3-compatible.local', # MinIO, etc. }, config_kwargs={ 'max_pool_connections': 50, 'retries': {'max_attempts': 5} }, skip_instance_cache=True # Don't cache bucket listings )
Async operations
import asyncio
async def read_multiple(): fs = s3fs.S3FileSystem(asynchronous=True) await fs.set_session() # Establish async session
# Concurrent reads (use _cat_file for bytes)
data = await asyncio.gather(
fs._cat_file('bucket/file1.parquet'),
fs._cat_file('bucket/file2.parquet'),
fs._cat_file('bucket/file3.parquet')
)
return data
S3-specific features
fs.find('my-bucket', prefix='data/2024') # List with prefix fs.du('my-bucket/data') # Disk usage fs.rm('my-bucket/temp/', recursive=True) # Recursive delete
Authentication
fsspec backends follow standard cloud authentication:
-
Explicit credentials (passed to constructor)
-
Environment variables (AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS, etc.)
-
Config files (~/.aws/credentials, gcloud CLI)
-
IAM roles / managed identities
See @data-engineering-storage-authentication for detailed patterns.
When to Use fsspec
Choose fsspec when:
-
You need broad ecosystem compatibility (pandas, xarray, Dask)
-
Working with multiple storage backends (S3, GCS, Azure, HTTP)
-
You need protocol chaining and caching features
-
Your workflow involves diverse data formats beyond Parquet
Performance Considerations
-
✅ Use filecache:: instead of simplecache:: for persistent caching across sessions
-
✅ Increase max_pool_connections for high concurrency
-
✅ Use async API for many concurrent small file operations
-
⚠️ For pure Parquet workflows with high throughput, consider pyarrow.fs instead
-
⚠️ For maximum performance on large concurrent operations, consider obstore
Integration with Data Engineering Tools
-
Polars: pl.read_parquet("s3://bucket/file.parquet", storage_options={...})
-
DuckDB: duckdb.register_filesystem(fsspec.filesystem('s3'))
-
Pandas: pd.read_csv("s3://bucket/file.csv") (auto-detects fsspec)
-
PyArrow: Wrap fsspec with pyarrow.fs.PyFileSystem(fs.FSSpecHandler(fs))
For detailed integration patterns, see:
-
@data-engineering-storage-remote-access/integrations/polars
-
@data-engineering-storage-remote-access/integrations/duckdb
-
@data-engineering-storage-remote-access/integrations/pandas
References
-
fsspec Documentation
-
s3fs Documentation
-
gcsfs Documentation
-
adlfs Documentation