Apache Iceberg with Cloud Storage
Configuring PyIceberg catalogs to store Iceberg tables on S3, GCS, or Azure Blob Storage.
Installation
pip install pyiceberg[pyarrow,pandas,aws] # AWS backend
or
pip install pyiceberg[pyarrow,rest] # REST catalog
Catalog Configuration
AWS Glue Catalog
from pyiceberg.catalog import load_catalog
catalog = load_catalog( "glue", **{ "type": "glue", "s3.region": "us-east-1", "s3.access-key-id": "AKIA...", # Optional: uses env/IAM if omitted "s3.secret-access-key": "...", } )
Credentials are read from environment variables (AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY ) or IAM roles by default. Pass explicitly only when necessary.
REST Catalog (Tabular, custom REST service)
catalog = load_catalog( "rest", **{ "uri": "https://iceberg-catalog.example.com", "s3.endpoint": "http://minio:9000", "s3.access-key-id": "minioadmin", "s3.secret-access-key": "minioadmin", } )
Hive Metastore
catalog = load_catalog( "hive", **{ "uri": "thrift://localhost:9083", "s3.endpoint": "http://minio:9000", } )
Local Development (No Catalog)
from pyiceberg.catalog import InMemoryCatalog
catalog = InMemoryCatalog("local")
Tables stored in ~/.pyiceberg/ by default (local file-based catalog)
Table Operations
Load existing table
table = catalog.load_table("db.my_table")
Scan with filter pushdown
scan = table.scan( row_filter="year = 2024 AND country = 'USA'", selected_fields=("id", "value", "timestamp") ) df = scan.to_pandas() # or .to_arrow(), .to_polars()
Append data
import pyarrow as pa new_data = pa.table({ "id": [4, 5], "value": [400.0, 500.0], "year": [2024, 2024] }) table.append(new_data)
Overwrite (replaces entire table)
table.overwrite(new_data)
Schema Evolution
Add column (non-breaking)
with table.update_schema() as update: update.add_column("country", StringType(), required=False)
Upgrade column type (e.g., int → long)
with table.update_schema() as update: update.upgrade_column("population", IntegerType(), required=False)
Cloud Storage Authentication
See @data-engineering-storage-authentication for:
-
AWS: AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY , IAM roles
-
GCS: GOOGLE_APPLICATION_CREDENTIALS
-
Azure: AZURE_STORAGE_ACCOUNT , AZURE_STORAGE_KEY
PyIceberg catalogs automatically detect these environment variables. Only provide explicit credentials for local development or non-standard setups.
Best Practices
-
✅ Use a catalog - Never manage Iceberg tables without catalog metadata
-
✅ Leverage partition evolution - Change partition specs without rewriting data
-
✅ Archive old snapshots - Run expire_snapshots() to limit metadata growth
-
✅ Schema evolution over schema enforcement - Iceberg is designed for evolving schemas
-
⚠️ Monitor table metadata size - Large histories slow operations
-
⚠️ Don't use local filesystem for production - Use a shared catalog (Glue, Hive, REST)
Performance
-
✅ Predicate pushdown: Use row_filter in scan() to skip irrelevant files
-
✅ Column pruning: Use selected_fields to read only needed columns
-
✅ Batch operations: Append multiple records at once for better throughput
-
✅ PyArrow backend: Use PyArrow tables (not pandas) for zero-copy operations
Related Skills
-
@data-engineering-storage-lakehouse/iceberg.md
-
Iceberg concepts and detailed API
-
@data-engineering-storage-lakehouse
-
Delta Lake vs Iceberg comparison
-
@data-engineering-storage-remote-access/libraries/pyarrow-fs
-
PyArrow filesystem for direct S3/GCS access
References
-
PyIceberg Documentation
-
Apache Iceberg Specification
-
Iceberg Catalog Configurations