Delta Lake on Cloud Storage
Integrating Delta Lake tables with cloud storage (S3, GCS, Azure) using the pure-Python deltalake package.
Installation
pip install deltalake pyarrow
Configuration Patterns
Method 1: storage_options (Recommended)
The simplest approach using dictionary-based configuration:
from deltalake import DeltaTable, write_deltalake import pyarrow as pa
S3 configuration
storage_options = { "AWS_ACCESS_KEY_ID": "AKIA...", "AWS_SECRET_ACCESS_KEY": "...", "AWS_REGION": "us-east-1" }
Alternatively, use environment variables (preferred for production)
os.environ['AWS_ACCESS_KEY_ID'], etc.
Write Delta table
write_deltalake( "s3://bucket/delta-table", data=pa_table, storage_options=storage_options, mode="overwrite", partition_by=["date"] )
Read Delta table
dt = DeltaTable( "s3://bucket/delta-table", storage_options=storage_options ) df = dt.to_pandas()
GCS configuration:
storage_options = { "GOOGLE_SERVICE_ACCOUNT_KEY_JSON": "/path/to/key.json" # Or use env var GOOGLE_APPLICATION_CREDENTIALS }
Azure configuration:
storage_options = { "AZURE_STORAGE_CONNECTION_STRING": "...", # OR: "AZURE_STORAGE_ACCOUNT_NAME" + "AZURE_STORAGE_ACCOUNT_KEY" }
Method 2: PyArrow Filesystem (Advanced)
Use PyArrow filesystem objects for more control:
import pyarrow.fs as fs from deltalake import write_deltalake, DeltaTable
Create filesystem
raw_fs, subpath = fs.FileSystem.from_uri("s3://bucket/delta-table") filesystem = fs.SubTreeFileSystem(subpath, raw_fs)
Write
write_deltalake( "delta-table", # relative to filesystem root data=pa_table, filesystem=filesystem, mode="append" )
Read
dt = DeltaTable("delta-table", filesystem=filesystem)
Time Travel
from deltalake import DeltaTable
dt = DeltaTable("s3://bucket/delta-table")
Load specific version
dt.load_version(5) df_v5 = dt.to_pandas()
Load by timestamp
dt.load_with_datetime("2024-01-01T12:00:00Z") df_ts = dt.to_pandas()
Get history
history = dt.history().to_pandas() print(history[["version", "timestamp", "operation"]])
Maintenance Operations
Vacuum old files (retention in hours)
dt.vacuum(retention_hours=24) # Clean files older than 24h
Optimize compaction (combine small files)
dt.optimize().execute()
Get file list
files = dt.files() print(files) # List of Parquet files in the table
Get metadata
details = dt.details() print(details)
Incremental Processing
For change data capture (CDC) patterns:
from deltalake import DeltaTable from datetime import datetime
dt = DeltaTable("s3://bucket/delta-table")
Get changes since last checkpoint
last_version = get_checkpoint() # Your checkpoint tracking
Read only added/modified files
changes = ( dt.history() .filter(f"version > {last_version}") .to_pyarrow_table() )
Or read full snapshot and compare
df = dt.to_pandas()
... compare with previous snapshot ...
Update checkpoint
save_checkpoint(dt.version())
Best Practices
-
✅ Use environment variables for credentials in production (never hardcode)
-
✅ Partition tables by date/region for efficient querying
-
✅ Vacuum regularly to clean up old files (but retain enough for your time travel needs)
-
✅ Optimize periodically to compact small files
-
✅ Track versions for incremental processing using dt.version() and dt.history()
-
⚠️ Don't disable vacuum entirely - storage bloat
-
⚠️ Don't vacuum too aggressively - you'll lose time travel capability
Authentication
See @data-engineering-storage-authentication for detailed cloud auth patterns.
For S3:
-
Environment: AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY , AWS_REGION
-
IAM roles (EC2, ECS, Lambda) override env vars
-
For S3-compatible (MinIO): AWS_ENDPOINT_URL or in storage_options
Related
-
@data-engineering-storage-lakehouse/delta-lake
-
Delta Lake concepts and API
-
@data-engineering-core
-
Using Delta with DuckDB
-
@data-engineering-storage-lakehouse
-
Comparisons with Iceberg, Hudi
References
-
deltalake Python API
-
Delta Lake Documentation