Databricks Reference Architecture
Overview
Production-ready lakehouse architecture with Unity Catalog, Delta Lake, and medallion pattern. Covers workspace organization, data governance, and CI/CD integration for enterprise data platforms.
Prerequisites
-
Databricks workspace with Unity Catalog enabled
-
Understanding of medallion architecture (bronze/silver/gold)
-
Databricks CLI configured
-
Terraform or Asset Bundles for infrastructure
Architecture Diagram
┌─────────────────────────────────────────────────────────────┐ │ Unity Catalog │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Bronze │ │ Silver │ │ Gold │ │ │ │ Catalog │─▶│ Catalog │─▶│ Catalog │ │ │ │ (raw) │ │ (clean) │ │ (curated)│ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ ▲ │ │ │ │ ▼ │ │ ┌──────────┐ ┌──────────────┐ │ │ │ Ingestion│ │ ML Models │ │ │ │ Jobs │ │ (MLflow) │ │ │ └──────────┘ └──────────────┘ │ ├─────────────────────────────────────────────────────────────┤ │ Compute: Job Clusters │ SQL Warehouses │ Interactive │ ├─────────────────────────────────────────────────────────────┤ │ Governance: Row-Level Security │ Column Masking │ Lineage │ └─────────────────────────────────────────────────────────────┘
Project Structure
databricks-project/ ├── src/ │ ├── ingestion/ │ │ ├── bronze_raw_events.py │ │ ├── bronze_api_data.py │ │ └── bronze_file_uploads.py │ ├── transformation/ │ │ ├── silver_clean_events.py │ │ ├── silver_deduplicate.py │ │ └── silver_schema_enforce.py │ ├── aggregation/ │ │ ├── gold_daily_metrics.py │ │ ├── gold_user_features.py │ │ └── gold_ml_features.py │ └── ml/ │ ├── training/ │ └── inference/ ├── tests/ │ ├── unit/ │ └── integration/ ├── databricks.yml # Asset Bundle config ├── resources/ │ ├── jobs.yml │ ├── pipelines.yml │ └── clusters.yml └── conf/ ├── dev.yml ├── staging.yml └── prod.yml
Instructions
Step 1: Configure Unity Catalog Hierarchy
-- Create catalog per environment CREATE CATALOG IF NOT EXISTS dev_catalog; CREATE CATALOG IF NOT EXISTS prod_catalog;
-- Create schemas per medallion layer CREATE SCHEMA IF NOT EXISTS prod_catalog.bronze; CREATE SCHEMA IF NOT EXISTS prod_catalog.silver; CREATE SCHEMA IF NOT EXISTS prod_catalog.gold; CREATE SCHEMA IF NOT EXISTS prod_catalog.ml_features;
-- Grant permissions
GRANT USE CATALOG ON CATALOG prod_catalog TO data-engineers;
GRANT USE SCHEMA ON SCHEMA prod_catalog.bronze TO data-engineers;
GRANT SELECT ON SCHEMA prod_catalog.gold TO analysts;
Step 2: Define Asset Bundle Configuration
databricks.yml
bundle: name: data-platform
workspace: host: https://your-workspace.databricks.com
resources: jobs: daily_etl: name: "Daily ETL Pipeline" schedule: quartz_cron_expression: "0 0 6 * * ?" timezone_id: "UTC" tasks: - task_key: bronze_ingest notebook_task: notebook_path: src/ingestion/bronze_raw_events.py job_cluster_key: etl_cluster - task_key: silver_transform depends_on: - task_key: bronze_ingest notebook_task: notebook_path: src/transformation/silver_clean_events.py job_cluster_key: etl_cluster
job_clusters: - job_cluster_key: etl_cluster new_cluster: spark_version: "14.3.x-scala2.12" node_type_id: "i3.xlarge" num_workers: 2 autoscale: min_workers: 1 max_workers: 4
targets: dev: default: true workspace: host: https://dev-workspace.databricks.com prod: workspace: host: https://prod-workspace.databricks.com
Step 3: Implement Medallion Pipeline Pattern
src/ingestion/bronze_raw_events.py
from pyspark.sql import SparkSession from pyspark.sql.functions import current_timestamp, input_file_name
spark = SparkSession.builder.getOrCreate()
Bronze: raw ingestion with metadata
raw_df = ( spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.schemaLocation", "/checkpoints/bronze/schema") .load("/data/raw/events/") )
bronze_df = raw_df.withColumn( "_ingested_at", current_timestamp() ).withColumn( "_source_file", input_file_name() )
(bronze_df.writeStream .format("delta") .outputMode("append") .option("checkpointLocation", "/checkpoints/bronze/events") .toTable("prod_catalog.bronze.raw_events"))
Error Handling
Issue Cause Solution
Schema evolution failure New columns in source Enable mergeSchema option
Job cluster timeout Long-running tasks Use autoscaling, increase timeout
Permission denied Missing catalog grants Check Unity Catalog ACLs
Delta version conflict Concurrent writes Use MERGE instead of INSERT
Examples
Quick Medallion Validation
Validate data flows through medallion layers
for layer in ["bronze", "silver", "gold"]: count = spark.table(f"prod_catalog.{layer}.events").count() print(f"{layer}: {count:,} rows")
Resources
-
Databricks Asset Bundles
-
Unity Catalog Best Practices
-
Medallion Architecture
Output
-
Configuration files or code changes applied to the project
-
Validation report confirming correct implementation
-
Summary of changes made and their rationale