data-lake-platform

Build and operate production data lakes and lakehouses: ingest, transform, store in open formats, and serve analytics reliably.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-lake-platform" with this command: npx skills add vasilyu1983/ai-agents-public/vasilyu1983-ai-agents-public-data-lake-platform

Data Lake Platform

Build and operate production data lakes and lakehouses: ingest, transform, store in open formats, and serve analytics reliably.

When to Use

  • Design data lake/lakehouse architecture

  • Set up ingestion pipelines (batch, incremental, CDC)

  • Build SQL transformation layers (SQLMesh, dbt)

  • Choose table formats and catalogs (Iceberg, Delta, Hudi)

  • Deploy query/serving engines (Trino, ClickHouse, DuckDB)

  • Implement streaming pipelines (Kafka, Flink)

  • Set up orchestration (Dagster, Airflow, Prefect)

  • Add governance, lineage, data quality, and cost controls

Triage Questions

  • Batch, streaming, or hybrid? What is the freshness SLO?

  • Append-only vs upserts/deletes (CDC)? Is time travel required?

  • Primary query pattern: BI dashboards (high concurrency), ad-hoc joins, embedded analytics?

  • PII/compliance: row/column-level access, retention, audit logging?

  • Platform constraints: self-hosted vs cloud, preferred engines, team strengths?

Default Baseline (Good Starting Point)

  • Storage: object storage + open table format (usually Iceberg)

  • Catalog: REST/Hive/Glue/Nessie/Unity (match your platform)

  • Transforms: SQLMesh or dbt (pick one and standardize)

  • Lake query: Trino (or Spark for heavy compute/ML workloads)

  • Serving (optional): ClickHouse/StarRocks/Doris for low-latency BI

  • Governance: DataHub/OpenMetadata + OpenLineage

  • Orchestration: Dagster/Airflow/Prefect

Workflow

  • Pick table format + catalog: references/storage-formats.md (use assets/cross-platform/template-schema-evolution.md and assets/cross-platform/template-partitioning-strategy.md )

  • Design ingestion (batch/incremental/CDC): references/ingestion-patterns.md (use assets/cross-platform/template-ingestion-governance-checklist.md and assets/cross-platform/template-incremental-loading.md )

  • Design transformations (bronze/silver/gold or data products): references/transformation-patterns.md (use assets/cross-platform/template-data-pipeline.md )

  • Choose lake query vs serving engines: references/query-engine-patterns.md

  • Add governance, lineage, and quality gates: references/governance-catalog.md (use assets/cross-platform/template-data-quality-governance.md and assets/cross-platform/template-data-quality.md )

  • Plan operations + cost controls: references/operational-playbook.md and references/cost-optimization.md (use assets/cross-platform/template-data-quality-backfill-runbook.md and assets/cross-platform/template-cost-optimization.md )

Architecture Patterns

  • Medallion (bronze/silver/gold): references/architecture-patterns.md

  • Data mesh (domain-owned data products): references/architecture-patterns.md

  • Streaming-first (Kappa): references/streaming-patterns.md

Quick Start

dlt + ClickHouse

pip install "dlt[clickhouse]" dlt init rest_api clickhouse python pipeline.py

SQLMesh + DuckDB

pip install sqlmesh sqlmesh init duckdb sqlmesh plan && sqlmesh run

Reliability and Safety

Do

  • Define data contracts and owners up front

  • Add quality gates (freshness, volume, schema, distribution) per tier

  • Make every pipeline idempotent and re-runnable (backfills are normal)

  • Treat access control and audit logging as first-class requirements

Avoid

  • Skipping validation to "move fast"

  • Storing PII without access controls

  • Pipelines that can't be re-run safely

  • Manual schema changes without version control

Resources

Resource Purpose

references/architecture-patterns.md Medallion, data mesh

references/ingestion-patterns.md dlt vs Airbyte, CDC

references/transformation-patterns.md SQLMesh vs dbt

references/storage-formats.md Iceberg vs Delta

references/query-engine-patterns.md ClickHouse, DuckDB

references/streaming-patterns.md Kafka, Flink

references/orchestration-patterns.md Dagster, Airflow

references/bi-visualization-patterns.md Metabase, Superset

references/cost-optimization.md Cost levers and maintenance

references/operational-playbook.md Monitoring and incident response

references/governance-catalog.md Catalog, lineage, access control

references/data-mesh-patterns.md Domain ownership, data products, federated governance

references/data-quality-patterns.md Quality gates, validation frameworks, SLOs, anomaly detection

references/security-access-patterns.md Row/column security, encryption, audit logging, compliance

Templates

Template Purpose

assets/cross-platform/template-medallion-architecture.md Baseline bronze/silver/gold plan

assets/cross-platform/template-data-pipeline.md End-to-end pipeline skeleton

assets/cross-platform/template-ingestion-governance-checklist.md Source onboarding checklist

assets/cross-platform/template-incremental-loading.md Incremental + backfill plan

assets/cross-platform/template-schema-evolution.md Schema change rules

assets/cross-platform/template-cost-optimization.md Cost control checklist

assets/cross-platform/template-data-quality-governance.md Quality contracts + SLOs

assets/cross-platform/template-data-quality-backfill-runbook.md Backfill incident/runbook

Related Skills

Skill Purpose

ai-mlops ML deployment

ai-ml-data-science Feature engineering

data-sql-optimization OLTP optimization

Fact-Checking

  • Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.

  • Prefer primary sources; report source links and dates for volatile information.

  • If web access is unavailable, state the limitation and mark guidance as unverified.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

product-management

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

marketing-visual-design

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

startup-idea-validation

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

software-architecture-design

No summary provided by upstream source.

Repository SourceNeeds Review