Apache Iceberg with Cloud Storage

Configuring PyIceberg catalogs to store Iceberg tables on S3, GCS, or Azure Blob Storage.

Installation

pip install pyiceberg[pyarrow,pandas,aws] # AWS backend

or

pip install pyiceberg[pyarrow,rest] # REST catalog

Catalog Configuration

AWS Glue Catalog

from pyiceberg.catalog import load_catalog

catalog = load_catalog( "glue", **{ "type": "glue", "s3.region": "us-east-1", "s3.access-key-id": "AKIA...", # Optional: uses env/IAM if omitted "s3.secret-access-key": "...", } )

Credentials are read from environment variables (AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY ) or IAM roles by default. Pass explicitly only when necessary.

REST Catalog (Tabular, custom REST service)

catalog = load_catalog( "rest", **{ "uri": "https://iceberg-catalog.example.com", "s3.endpoint": "http://minio:9000", "s3.access-key-id": "minioadmin", "s3.secret-access-key": "minioadmin", } )

Hive Metastore

catalog = load_catalog( "hive", **{ "uri": "thrift://localhost:9083", "s3.endpoint": "http://minio:9000", } )

Local Development (No Catalog)

from pyiceberg.catalog import InMemoryCatalog

catalog = InMemoryCatalog("local")

Tables stored in ~/.pyiceberg/ by default (local file-based catalog)

Table Operations

Load existing table

table = catalog.load_table("db.my_table")

Scan with filter pushdown

scan = table.scan( row_filter="year = 2024 AND country = 'USA'", selected_fields=("id", "value", "timestamp") ) df = scan.to_pandas() # or .to_arrow(), .to_polars()

Append data

import pyarrow as pa new_data = pa.table({ "id": [4, 5], "value": [400.0, 500.0], "year": [2024, 2024] }) table.append(new_data)

Overwrite (replaces entire table)

table.overwrite(new_data)

Schema Evolution

Add column (non-breaking)

with table.update_schema() as update: update.add_column("country", StringType(), required=False)

Upgrade column type (e.g., int → long)

with table.update_schema() as update: update.upgrade_column("population", IntegerType(), required=False)

Cloud Storage Authentication

See @data-engineering-storage-authentication for:

AWS: AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY , IAM roles
GCS: GOOGLE_APPLICATION_CREDENTIALS
Azure: AZURE_STORAGE_ACCOUNT , AZURE_STORAGE_KEY

PyIceberg catalogs automatically detect these environment variables. Only provide explicit credentials for local development or non-standard setups.

Best Practices

✅ Use a catalog - Never manage Iceberg tables without catalog metadata
✅ Leverage partition evolution - Change partition specs without rewriting data
✅ Archive old snapshots - Run expire_snapshots() to limit metadata growth
✅ Schema evolution over schema enforcement - Iceberg is designed for evolving schemas
⚠️ Monitor table metadata size - Large histories slow operations
⚠️ Don't use local filesystem for production - Use a shared catalog (Glue, Hive, REST)

Performance

✅ Predicate pushdown: Use row_filter in scan() to skip irrelevant files
✅ Column pruning: Use selected_fields to read only needed columns
✅ Batch operations: Append multiple records at once for better throughput
✅ PyArrow backend: Use PyArrow tables (not pandas) for zero-copy operations

Related Skills

@data-engineering-storage-lakehouse/iceberg.md
Iceberg concepts and detailed API
@data-engineering-storage-lakehouse
Delta Lake vs Iceberg comparison
@data-engineering-storage-remote-access/libraries/pyarrow-fs
PyArrow filesystem for direct S3/GCS access

References

PyIceberg Documentation
Apache Iceberg Specification
Iceberg Catalog Configurations

data-engineering-storage-remote-access-integrations-iceberg

Safety Notice

Copy this and send it to your AI assistant to learn

or

Tables stored in ~/.pyiceberg/ by default (local file-based catalog)

Load existing table

Scan with filter pushdown

Append data

Overwrite (replaces entire table)

Add column (non-breaking)

Upgrade column type (e.g., int → long)

Source Transparency

Related Skills

data-science-eda

data-science-feature-engineering

data-engineering-core

data-science-notebooks