data-architecture

Modern data architecture patterns including data lakes, lakehouses, data mesh, and data platform design.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-architecture" with this command: npx skills add melodic-software/claude-code-plugins/melodic-software-claude-code-plugins-data-architecture

Data Architecture

Modern data architecture patterns including data lakes, lakehouses, data mesh, and data platform design.

When to Use This Skill

  • Choosing between data lake, warehouse, and lakehouse

  • Designing a modern data platform

  • Implementing data mesh principles

  • Planning data storage strategy

  • Understanding data architecture trade-offs

Data Architecture Evolution

Generation 1: Data Warehouse (1990s-2000s)

  • Structured data only
  • ETL into warehouse
  • Star/snowflake schemas
  • SQL-based analytics

Generation 2: Data Lake (2010s)

  • All data types (structured, semi, unstructured)
  • Schema-on-read
  • Hadoop/HDFS based
  • Cheap storage, complex processing

Generation 3: Lakehouse (2020s)

  • Best of both: lake flexibility + warehouse features
  • ACID transactions on lake
  • Schema enforcement optional
  • Unified analytics and ML

Architecture Comparison

Data Warehouse

┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Sources │ ──► │ ETL │ ──► │ Warehouse │ │ (Structured)│ │ (Transform) │ │ (Star/Snow) │ └─────────────┘ └─────────────┘ └─────────────┘ │ ▼ ┌─────────────┐ │ BI │ │ Analytics │ └─────────────┘

Characteristics:

  • Schema-on-write
  • Optimized for SQL queries
  • Structured data only
  • High data quality
  • Expensive storage

Best for:

  • Business intelligence
  • Financial reporting
  • Structured analytics

Data Lake

┌─────────────┐ ┌─────────────┐ │ Sources │ ──► │ Data Lake │ │ (All) │ │ (Raw) │ └─────────────┘ └─────────────┘ │ ┌────────────────┼────────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ ML │ │ ETL │ │ Spark │ │ Training│ │ to DW │ │ Analysis│ └─────────┘ └─────────┘ └─────────┘

Characteristics:

  • Schema-on-read
  • All data types
  • Cheap storage
  • Flexible processing
  • Risk of "data swamp"

Best for:

  • Data science/ML
  • Unstructured data
  • Experimental analysis

Data Lakehouse

┌─────────────┐ ┌─────────────────────────────────┐ │ Sources │ ──► │ Data Lakehouse │ │ (All) │ │ ┌──────────────────────────┐ │ └─────────────┘ │ │ Metadata Layer │ │ │ │ (Delta/Iceberg/Hudi) │ │ │ └──────────────────────────┘ │ │ ┌──────────────────────────┐ │ │ │ Storage Layer │ │ │ │ (Object Storage) │ │ │ └──────────────────────────┘ │ └─────────────────────────────────┘ │ ┌────────────────────┼────────────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ SQL │ │ ML │ │ Stream │ │ BI │ │ Workload│ │ Process │ └─────────┘ └─────────┘ └─────────┘

Characteristics:

  • ACID transactions
  • Schema evolution
  • Time travel
  • Unified batch/streaming
  • Open formats

Best for:

  • Unified analytics
  • Both BI and ML
  • Modern data platforms

Architecture Selection Guide

Factor Warehouse Lake Lakehouse

Data types Structured All All

Query performance Excellent Poor-Medium Good

Data quality High Variable Configurable

Cost High Low Medium

ML workloads Limited Excellent Excellent

Real-time Limited Good Good

Governance Strong Weak Strong

Complexity Low High Medium

Decision Tree:

Is data mostly structured with BI focus? ├── Yes → Data Warehouse └── No └── Need ML + BI on same data? ├── Yes → Lakehouse └── No └── Primarily ML/unstructured? ├── Yes → Data Lake └── No → Lakehouse

Lakehouse Technologies

Delta Lake (Databricks)

Features:

  • ACID transactions
  • Time travel (data versioning)
  • Schema enforcement/evolution
  • Unified batch/streaming
  • Optimized performance (Z-ordering, compaction)

File format: Parquet + Delta log

Apache Iceberg (Netflix)

Features:

  • ACID transactions
  • Hidden partitioning
  • Schema evolution
  • Time travel
  • Vendor neutral

File format: Parquet/ORC/Avro + metadata

Apache Hudi (Uber)

Features:

  • ACID transactions
  • Incremental processing
  • Record-level updates
  • Time travel
  • Optimized for streaming

File format: Parquet + Hudi metadata

Technology Comparison

Feature Delta Lake Iceberg Hudi

ACID Yes Yes Yes

Time Travel Yes Yes Yes

Schema Evolution Good Excellent Good

Streaming Excellent Good Excellent

Ecosystem Databricks Wide Wide

Performance Excellent Excellent Good

Community Large Growing Medium

Data Mesh

Principles

Data Mesh = Decentralized data architecture

Four Principles:

  1. Domain Ownership

    • Data owned by domain teams
    • Not centralized data team
  2. Data as a Product

    • Treat data like a product
    • Quality, discoverability, usability
  3. Self-Serve Platform

    • Platform enables domain teams
    • Reduces friction
  4. Federated Governance

    • Global standards
    • Local implementation

Data Products

Data Product = Autonomous unit of data

Components: ┌──────────────────────────────────────┐ │ Data Product │ │ ┌──────────┐ ┌──────────────────┐ │ │ │ Data │ │ Metadata │ │ │ │ (Tables) │ │ (Schema, docs) │ │ │ └──────────┘ └──────────────────┘ │ │ ┌──────────┐ ┌──────────────────┐ │ │ │ Code │ │ APIs │ │ │ │ (ETL) │ │ (Access layer) │ │ │ └──────────┘ └──────────────────┘ │ │ ┌──────────────────────────────────┐│ │ │ Quality + SLAs ││ │ └──────────────────────────────────┘│ └──────────────────────────────────────┘

Data Mesh vs Centralized

Aspect Centralized Data Mesh

Ownership Central data team Domain teams

Scaling Team bottleneck Scales with org

Domain knowledge Lost in translation Preserved

Governance Centralized Federated

Implementation Uniform Heterogeneous

Complexity Lower initially Higher initially

Data Modeling Patterns

Star Schema

    ┌─────────────┐
    │  Dim_Time   │
    └──────┬──────┘
           │

┌───────────┐ │ ┌───────────┐ │Dim_Product├──┼──┤Dim_Customer│ └───────────┘ │ └───────────┘ │ ┌──────┴──────┐ │ Fact_Sales │ └─────────────┘

Pros: Simple, fast queries Cons: Denormalized, redundancy Best for: BI, reporting

Snowflake Schema

Normalized dimensions: Dim_Product → Dim_Category → Dim_Subcategory

Pros: Less redundancy Cons: More joins, slower Best for: Complex hierarchies

Data Vault

Hub (business keys) ←→ Link (relationships) ←→ Satellite (attributes)

Pros: Auditable, flexible, scalable Cons: Complex, learning curve Best for: Enterprise data warehouse

Storage Layers

Bronze/Silver/Gold (Medallion Architecture)

┌─────────┐ ┌─────────┐ ┌─────────┐ │ Bronze │ ──► │ Silver │ ──► │ Gold │ │ (Raw) │ │(Cleaned)│ │(Curated)│ └─────────┘ └─────────┘ └─────────┘

Bronze: Raw ingestion, append-only Silver: Cleaned, validated, conformed Gold: Business-level aggregates, features

Zones in Data Lake

Landing Zone: Raw files from sources Raw Zone: Structured raw data Curated Zone: Transformed, quality-checked Consumption Zone: Ready for analytics Sandbox Zone: Exploration and experimentation

Best Practices

Data Quality

Implement quality gates:

  • Schema validation
  • Null checks
  • Range validation
  • Referential integrity
  • Freshness monitoring

Governance

Key capabilities:

  • Data catalog
  • Lineage tracking
  • Access control
  • Privacy compliance
  • Audit logging

Performance

Optimization techniques:

  • Partitioning (by date, region)
  • Clustering/Z-ordering
  • Compaction
  • Caching
  • Materialized views

Related Skills

  • etl-elt-patterns

  • Data transformation

  • stream-processing

  • Real-time data

  • database-scaling

  • Database patterns

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

design-thinking

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

plantuml-syntax

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

system-prompt-engineering

No summary provided by upstream source.

Repository SourceNeeds Review