Delta Lake on Cloud Storage

Integrating Delta Lake tables with cloud storage (S3, GCS, Azure) using the pure-Python deltalake package.

Installation

pip install deltalake pyarrow

Configuration Patterns

Method 1: storage_options (Recommended)

The simplest approach using dictionary-based configuration:

from deltalake import DeltaTable, write_deltalake import pyarrow as pa

S3 configuration

storage_options = { "AWS_ACCESS_KEY_ID": "AKIA...", "AWS_SECRET_ACCESS_KEY": "...", "AWS_REGION": "us-east-1" }

Alternatively, use environment variables (preferred for production)

os.environ['AWS_ACCESS_KEY_ID'], etc.

Write Delta table

write_deltalake( "s3://bucket/delta-table", data=pa_table, storage_options=storage_options, mode="overwrite", partition_by=["date"] )

Read Delta table

dt = DeltaTable( "s3://bucket/delta-table", storage_options=storage_options ) df = dt.to_pandas()

GCS configuration:

storage_options = { "GOOGLE_SERVICE_ACCOUNT_KEY_JSON": "/path/to/key.json" # Or use env var GOOGLE_APPLICATION_CREDENTIALS }

Azure configuration:

storage_options = { "AZURE_STORAGE_CONNECTION_STRING": "...", # OR: "AZURE_STORAGE_ACCOUNT_NAME" + "AZURE_STORAGE_ACCOUNT_KEY" }

Method 2: PyArrow Filesystem (Advanced)

Use PyArrow filesystem objects for more control:

import pyarrow.fs as fs from deltalake import write_deltalake, DeltaTable

Create filesystem

raw_fs, subpath = fs.FileSystem.from_uri("s3://bucket/delta-table") filesystem = fs.SubTreeFileSystem(subpath, raw_fs)

Write

write_deltalake( "delta-table", # relative to filesystem root data=pa_table, filesystem=filesystem, mode="append" )

Read

dt = DeltaTable("delta-table", filesystem=filesystem)

Time Travel

from deltalake import DeltaTable

dt = DeltaTable("s3://bucket/delta-table")

Load specific version

dt.load_version(5) df_v5 = dt.to_pandas()

Load by timestamp

dt.load_with_datetime("2024-01-01T12:00:00Z") df_ts = dt.to_pandas()

Get history

history = dt.history().to_pandas() print(history[["version", "timestamp", "operation"]])

Maintenance Operations

Vacuum old files (retention in hours)

dt.vacuum(retention_hours=24) # Clean files older than 24h

Optimize compaction (combine small files)

dt.optimize().execute()

Get file list

files = dt.files() print(files) # List of Parquet files in the table

Get metadata

details = dt.details() print(details)

Incremental Processing

For change data capture (CDC) patterns:

from deltalake import DeltaTable from datetime import datetime

dt = DeltaTable("s3://bucket/delta-table")

Get changes since last checkpoint

last_version = get_checkpoint() # Your checkpoint tracking

Read only added/modified files

changes = ( dt.history() .filter(f"version > {last_version}") .to_pyarrow_table() )

Or read full snapshot and compare

df = dt.to_pandas()

... compare with previous snapshot ...

Update checkpoint

save_checkpoint(dt.version())

Best Practices

✅ Use environment variables for credentials in production (never hardcode)
✅ Partition tables by date/region for efficient querying
✅ Vacuum regularly to clean up old files (but retain enough for your time travel needs)
✅ Optimize periodically to compact small files
✅ Track versions for incremental processing using dt.version() and dt.history()
⚠️ Don't disable vacuum entirely - storage bloat
⚠️ Don't vacuum too aggressively - you'll lose time travel capability

Authentication

See @data-engineering-storage-authentication for detailed cloud auth patterns.

For S3:

Environment: AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY , AWS_REGION
IAM roles (EC2, ECS, Lambda) override env vars
For S3-compatible (MinIO): AWS_ENDPOINT_URL or in storage_options

@data-engineering-storage-lakehouse/delta-lake
Delta Lake concepts and API
@data-engineering-core
Using Delta with DuckDB
@data-engineering-storage-lakehouse
Comparisons with Iceberg, Hudi

References

deltalake Python API
Delta Lake Documentation

data-engineering-storage-remote-access-integrations-delta-lake

Safety Notice

Copy this and send it to your AI assistant to learn