data-engineering-storage-remote-access-integrations-delta-lake

Delta Lake on Cloud Storage

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-engineering-storage-remote-access-integrations-delta-lake" with this command: npx skills add legout/data-platform-agent-skills/legout-data-platform-agent-skills-data-engineering-storage-remote-access-integrations-delta-lake

Delta Lake on Cloud Storage

Integrating Delta Lake tables with cloud storage (S3, GCS, Azure) using the pure-Python deltalake package.

Installation

pip install deltalake pyarrow

Configuration Patterns

Method 1: storage_options (Recommended)

The simplest approach using dictionary-based configuration:

from deltalake import DeltaTable, write_deltalake import pyarrow as pa

S3 configuration

storage_options = { "AWS_ACCESS_KEY_ID": "AKIA...", "AWS_SECRET_ACCESS_KEY": "...", "AWS_REGION": "us-east-1" }

Alternatively, use environment variables (preferred for production)

os.environ['AWS_ACCESS_KEY_ID'], etc.

Write Delta table

write_deltalake( "s3://bucket/delta-table", data=pa_table, storage_options=storage_options, mode="overwrite", partition_by=["date"] )

Read Delta table

dt = DeltaTable( "s3://bucket/delta-table", storage_options=storage_options ) df = dt.to_pandas()

GCS configuration:

storage_options = { "GOOGLE_SERVICE_ACCOUNT_KEY_JSON": "/path/to/key.json" # Or use env var GOOGLE_APPLICATION_CREDENTIALS }

Azure configuration:

storage_options = { "AZURE_STORAGE_CONNECTION_STRING": "...", # OR: "AZURE_STORAGE_ACCOUNT_NAME" + "AZURE_STORAGE_ACCOUNT_KEY" }

Method 2: PyArrow Filesystem (Advanced)

Use PyArrow filesystem objects for more control:

import pyarrow.fs as fs from deltalake import write_deltalake, DeltaTable

Create filesystem

raw_fs, subpath = fs.FileSystem.from_uri("s3://bucket/delta-table") filesystem = fs.SubTreeFileSystem(subpath, raw_fs)

Write

write_deltalake( "delta-table", # relative to filesystem root data=pa_table, filesystem=filesystem, mode="append" )

Read

dt = DeltaTable("delta-table", filesystem=filesystem)

Time Travel

from deltalake import DeltaTable

dt = DeltaTable("s3://bucket/delta-table")

Load specific version

dt.load_version(5) df_v5 = dt.to_pandas()

Load by timestamp

dt.load_with_datetime("2024-01-01T12:00:00Z") df_ts = dt.to_pandas()

Get history

history = dt.history().to_pandas() print(history[["version", "timestamp", "operation"]])

Maintenance Operations

Vacuum old files (retention in hours)

dt.vacuum(retention_hours=24) # Clean files older than 24h

Optimize compaction (combine small files)

dt.optimize().execute()

Get file list

files = dt.files() print(files) # List of Parquet files in the table

Get metadata

details = dt.details() print(details)

Incremental Processing

For change data capture (CDC) patterns:

from deltalake import DeltaTable from datetime import datetime

dt = DeltaTable("s3://bucket/delta-table")

Get changes since last checkpoint

last_version = get_checkpoint() # Your checkpoint tracking

Read only added/modified files

changes = ( dt.history() .filter(f"version > {last_version}") .to_pyarrow_table() )

Or read full snapshot and compare

df = dt.to_pandas()

... compare with previous snapshot ...

Update checkpoint

save_checkpoint(dt.version())

Best Practices

  • ✅ Use environment variables for credentials in production (never hardcode)

  • ✅ Partition tables by date/region for efficient querying

  • ✅ Vacuum regularly to clean up old files (but retain enough for your time travel needs)

  • ✅ Optimize periodically to compact small files

  • ✅ Track versions for incremental processing using dt.version() and dt.history()

  • ⚠️ Don't disable vacuum entirely - storage bloat

  • ⚠️ Don't vacuum too aggressively - you'll lose time travel capability

Authentication

See @data-engineering-storage-authentication for detailed cloud auth patterns.

For S3:

  • Environment: AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY , AWS_REGION

  • IAM roles (EC2, ECS, Lambda) override env vars

  • For S3-compatible (MinIO): AWS_ENDPOINT_URL or in storage_options

Related

  • @data-engineering-storage-lakehouse/delta-lake

  • Delta Lake concepts and API

  • @data-engineering-core

  • Using Delta with DuckDB

  • @data-engineering-storage-lakehouse

  • Comparisons with Iceberg, Hudi

References

  • deltalake Python API

  • Delta Lake Documentation

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

data-science-eda

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-visualization

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-feature-engineering

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering-core

No summary provided by upstream source.

Repository SourceNeeds Review