data-engineering-storage-remote-access-integrations-iceberg

Apache Iceberg with Cloud Storage

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-engineering-storage-remote-access-integrations-iceberg" with this command: npx skills add legout/data-platform-agent-skills/legout-data-platform-agent-skills-data-engineering-storage-remote-access-integrations-iceberg

Apache Iceberg with Cloud Storage

Configuring PyIceberg catalogs to store Iceberg tables on S3, GCS, or Azure Blob Storage.

Installation

pip install pyiceberg[pyarrow,pandas,aws] # AWS backend

or

pip install pyiceberg[pyarrow,rest] # REST catalog

Catalog Configuration

AWS Glue Catalog

from pyiceberg.catalog import load_catalog

catalog = load_catalog( "glue", **{ "type": "glue", "s3.region": "us-east-1", "s3.access-key-id": "AKIA...", # Optional: uses env/IAM if omitted "s3.secret-access-key": "...", } )

Credentials are read from environment variables (AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY ) or IAM roles by default. Pass explicitly only when necessary.

REST Catalog (Tabular, custom REST service)

catalog = load_catalog( "rest", **{ "uri": "https://iceberg-catalog.example.com", "s3.endpoint": "http://minio:9000", "s3.access-key-id": "minioadmin", "s3.secret-access-key": "minioadmin", } )

Hive Metastore

catalog = load_catalog( "hive", **{ "uri": "thrift://localhost:9083", "s3.endpoint": "http://minio:9000", } )

Local Development (No Catalog)

from pyiceberg.catalog import InMemoryCatalog

catalog = InMemoryCatalog("local")

Tables stored in ~/.pyiceberg/ by default (local file-based catalog)

Table Operations

Load existing table

table = catalog.load_table("db.my_table")

Scan with filter pushdown

scan = table.scan( row_filter="year = 2024 AND country = 'USA'", selected_fields=("id", "value", "timestamp") ) df = scan.to_pandas() # or .to_arrow(), .to_polars()

Append data

import pyarrow as pa new_data = pa.table({ "id": [4, 5], "value": [400.0, 500.0], "year": [2024, 2024] }) table.append(new_data)

Overwrite (replaces entire table)

table.overwrite(new_data)

Schema Evolution

Add column (non-breaking)

with table.update_schema() as update: update.add_column("country", StringType(), required=False)

Upgrade column type (e.g., int → long)

with table.update_schema() as update: update.upgrade_column("population", IntegerType(), required=False)

Cloud Storage Authentication

See @data-engineering-storage-authentication for:

  • AWS: AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY , IAM roles

  • GCS: GOOGLE_APPLICATION_CREDENTIALS

  • Azure: AZURE_STORAGE_ACCOUNT , AZURE_STORAGE_KEY

PyIceberg catalogs automatically detect these environment variables. Only provide explicit credentials for local development or non-standard setups.

Best Practices

  • ✅ Use a catalog - Never manage Iceberg tables without catalog metadata

  • ✅ Leverage partition evolution - Change partition specs without rewriting data

  • ✅ Archive old snapshots - Run expire_snapshots() to limit metadata growth

  • ✅ Schema evolution over schema enforcement - Iceberg is designed for evolving schemas

  • ⚠️ Monitor table metadata size - Large histories slow operations

  • ⚠️ Don't use local filesystem for production - Use a shared catalog (Glue, Hive, REST)

Performance

  • ✅ Predicate pushdown: Use row_filter in scan() to skip irrelevant files

  • ✅ Column pruning: Use selected_fields to read only needed columns

  • ✅ Batch operations: Append multiple records at once for better throughput

  • ✅ PyArrow backend: Use PyArrow tables (not pandas) for zero-copy operations

Related Skills

  • @data-engineering-storage-lakehouse/iceberg.md

  • Iceberg concepts and detailed API

  • @data-engineering-storage-lakehouse

  • Delta Lake vs Iceberg comparison

  • @data-engineering-storage-remote-access/libraries/pyarrow-fs

  • PyArrow filesystem for direct S3/GCS access

References

  • PyIceberg Documentation

  • Apache Iceberg Specification

  • Iceberg Catalog Configurations

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

data-science-eda

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-feature-engineering

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering-core

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-notebooks

No summary provided by upstream source.

Repository SourceNeeds Review