data-engineering

Welcome to the comprehensive data engineering skill suite. This hub organizes all data engineering knowledge into logical, non-overlapping domains.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-engineering" with this command: npx skills add legout/data-platform-agent-skills/legout-data-platform-agent-skills-data-engineering

Data Engineering Hub

Welcome to the comprehensive data engineering skill suite. This hub organizes all data engineering knowledge into logical, non-overlapping domains.

Skill Map

Domain Skills When to Use

Core @data-engineering-core

Polars, DuckDB, PyArrow fundamentals; ETL patterns; error handling; performance optimization

Storage @data-engineering-storage-lakehouse

Delta Lake, Apache Iceberg, Apache Hudi

@data-engineering-storage-remote-access

fsspec, pyarrow.fs, obstore; cloud access patterns

@data-engineering-storage-authentication

AWS, GCP, Azure auth - IAM roles, managed identity, secrets management

@data-engineering-storage-formats

Parquet optimizations, Lance, Zarr, Avro, ORC

Orchestration @data-engineering-orchestration

Prefect, Dagster, dbt, workflow scheduling

Streaming @data-engineering-streaming

Kafka, MQTT, NATS JetStream for real-time data

Quality @data-engineering-quality

Great Expectations, Pandera for data validation

Observability @data-engineering-observability

OpenTelemetry, Prometheus for pipeline monitoring

AI/ML @data-engineering-ai-ml

Embeddings, vector databases, RAG pipelines

Best Practices @data-engineering-best-practices

Medallion architecture, partitioning, file sizing, incremental loads, schema evolution, testing

Catalogs @data-engineering-catalogs

Data catalog systems: Iceberg catalogs, DuckDB multi-source, Amundsen/DataHub/OpenMetadata

Quick Reference: Core Stack

Task Recommended Tool

DataFrame operations Polars (10-50x faster than pandas)

SQL analytics DuckDB (embedded OLAP, zero-copy Arrow integration)

Data interchange PyArrow (Arrow format, zero-copy transfers)

Cloud storage access fsspec (universal), pyarrow.fs (Arrow-native), obstore (high-performance)

Lakehouse format Delta Lake (Spark ecosystem), Iceberg (engine-agnostic), Hudi (streaming CDC)

Orchestration Prefect (Pythonic flows), Dagster (asset-based), dbt (SQL transformations)

Validation Pandera (lightweight), Great Expectations (enterprise)

Getting Started

New to Data Engineering?

Start with @data-engineering-core to learn the foundational libraries and patterns.

Working with Cloud Storage?

Go to @data-engineering-storage-remote-access for fsspec, pyarrow.fs, and obstore.

Building Data Lakes?

Explore @data-engineering-storage-lakehouse for ACID table formats.

Choosing a Data Catalog?

Check @data-engineering-catalogs for Iceberg catalogs, DuckDB multi-source patterns, and tool comparisons.

Production-Grade Pipelines?

Read @data-engineering-best-practices for medallion architecture, partitioning, schema evolution, and testing strategies.

Orchestrating Pipelines?

Check @data-engineering-orchestration for Prefect, Dagster, and dbt.

Production Monitoring?

See @data-engineering-observability for tracing and metrics.

AI/ML Data Pipelines?

Visit @data-engineering-ai-ml for embeddings, vector databases, and RAG.

Principles

  • Lazy evaluation: Use Polars lazy frames and DuckDB query planning for performance

  • Zero-copy data transfer: Leverage Arrow format for memory efficiency

  • Pushdown optimization: Filter at storage layer to minimize data transfer

  • Type safety: Use explicit schemas and type hints

  • Resilience: Implement retries, circuit breakers, and proper error handling

  • Observability: Instrument pipelines with traces and metrics

  • Security: Never hardcode credentials; use IAM roles and environment variables

Migration from Legacy Skills

This restructured suite replaces the previous split organization (data-engineering-* and remote-filesystems-* ). All content has been consolidated to eliminate duplication and clarify ownership.

Legacy skill replacements:

  • data-engineering-core → @data-engineering-core (plus specific integrations)

  • data-engineering-lakehouse → @data-engineering-storage-lakehouse

  • data-engineering-orchestration → @data-engineering-orchestration

  • data-engineering-streaming → @data-engineering-streaming

  • data-engineering-quality → @data-engineering-quality

  • data-engineering-observability → @data-engineering-observability

  • data-engineering-llm-pipelines → @data-engineering-ai-ml

  • remote-filesystems-* → @data-engineering-storage-remote-access and integrations

All legacy skills remain functional but are deprecated. New content should be added to the new structure only.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

data-science-eda

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering-core

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-feature-engineering

No summary provided by upstream source.

Repository SourceNeeds Review