Databricks Core Workflow A: Delta Lake ETL

Contents

Overview
Prerequisites
Instructions
Output
Error Handling
Examples
Resources

Overview

Build production Delta Lake ETL pipelines using medallion architecture (Bronze -> Silver -> Gold).

Prerequisites

Completed databricks-install-auth setup
Understanding of Delta Lake concepts
Unity Catalog configured (recommended)

Medallion Architecture

Raw Sources -> Bronze (Raw/Landing) -> Silver (Cleaned/Business Logic) -> Gold (Aggregated/Analytics Ready)

Instructions

Step 1: Bronze Layer - Raw Ingestion

Ingest raw data with metadata columns (_ingested_at , _source_file ). Use mergeSchema for schema evolution. Use Auto Loader (cloudFiles ) for streaming ingestion with schema inference.

Step 2: Silver Layer - Data Cleansing

Read from Bronze using Change Data Feed. Apply transformations: trim/lowercase strings, parse timestamps, hash PII, filter nulls, generate surrogate keys. Merge into Silver with upsert pattern.

Step 3: Gold Layer - Business Aggregations

Aggregate Silver data by business dimensions and time grain. Use partition-level overwrites for efficient updates.

Step 4: Delta Live Tables (Optional)

Declarative pipeline with @dlt.table decorators and data quality expectations (@dlt.expect_or_drop ).

See detailed implementation for complete Bronze/Silver/Gold pipeline code, Auto Loader config, DLT pipeline, and orchestration example.

Output

Bronze layer with raw data and ingestion metadata
Silver layer with cleansed, deduplicated data
Gold layer with business aggregations
Delta Lake tables with ACID transactions

Error Handling

Error Cause Solution

Schema mismatch Source schema changed Use mergeSchema option

Duplicate records Missing deduplication Add merge logic with primary keys

Null values Data quality issues Add expectations/filters in Silver

Memory errors Large aggregations Increase cluster size or partition data

Examples

Quick Pipeline Run

Full medallion pipeline

bronze.ingest_to_bronze(spark, "/mnt/landing/orders/", "catalog.bronze.orders") silver.transform_to_silver(spark, "catalog.bronze.orders", "catalog.silver.orders", primary_keys=["order_id"]) gold.aggregate_to_gold(spark, "catalog.silver.orders", "catalog.gold.metrics", group_by_columns=["region"])

Resources

Delta Lake Guide
Medallion Architecture
Delta Live Tables
Auto Loader

Next Steps

For ML workflows, see databricks-core-workflow-b .

databricks-core-workflow-a

Safety Notice

Copy this and send it to your AI assistant to learn

Full medallion pipeline

Source Transparency

Related Skills

backtesting-trading-strategies

svg-icon-generator

performance-lighthouse-runner

mindmap-generator