Senior Data Engineer
Core Capabilities
-
Batch Pipeline Orchestration - Design and implement production-ready ETL/ELT pipelines with Airflow, intelligent dependency resolution, retry logic, and comprehensive monitoring
-
Real-Time Streaming - Build event-driven streaming pipelines with Kafka, Flink, Kinesis, and Spark Streaming with exactly-once semantics and sub-second latency
-
Data Quality Management - Comprehensive batch and streaming data quality validation covering completeness, accuracy, consistency, timeliness, and validity
-
Streaming Quality Monitoring - Track consumer lag, data freshness, schema drift, throughput, and dead letter queue rates for streaming pipelines
-
Performance Optimization - Analyze and optimize pipeline performance with query optimization, Spark tuning, and cost analysis recommendations
Key Workflows
Workflow 1: Build ETL Pipeline
Time: 2-4 hours
Steps:
-
Design pipeline architecture using Lambda, Kappa, or Medallion pattern
-
Configure YAML pipeline definition with sources, transformations, targets
-
Generate Airflow DAG with pipeline_orchestrator.py
-
Define data quality validation rules
-
Deploy and configure monitoring/alerting
Expected Output: Production-ready ETL pipeline with 99%+ success rate, automated quality checks, and comprehensive monitoring
Workflow 2: Build Real-Time Streaming Pipeline
Time: 3-5 days
Steps:
-
Select streaming architecture (Kappa vs Lambda) based on requirements
-
Configure streaming pipeline YAML (sources, processing, sinks, quality)
-
Generate Kafka configurations with kafka_config_generator.py
-
Generate Flink/Spark job scaffolding with stream_processor.py
-
Deploy and monitor with streaming_quality_validator.py
Expected Output: Streaming pipeline processing 10K+ events/sec with P99 latency < 1s, exactly-once delivery, and real-time quality monitoring
World-class data engineering for production-grade data systems, scalable pipelines, and enterprise data platforms.
Overview
This skill provides comprehensive expertise in data engineering fundamentals through advanced production patterns. From designing medallion architectures to implementing real-time streaming pipelines, it covers the full spectrum of modern data engineering including ETL/ELT design, data quality frameworks, pipeline orchestration, and DataOps practices.
What This Skill Provides:
-
Production-ready pipeline templates (Airflow, Spark, dbt)
-
Comprehensive data quality validation framework
-
Performance optimization and cost analysis tools
-
Data architecture patterns (Lambda, Kappa, Medallion)
-
Complete DataOps CI/CD workflows
Best For:
-
Building scalable data pipelines for enterprise systems
-
Implementing data quality and governance frameworks
-
Optimizing ETL performance and cloud costs
-
Designing modern data architectures (lake, warehouse, lakehouse)
-
Production ML/AI data infrastructure
Quick Start
Pipeline Orchestration
Generate Airflow DAG from configuration
python scripts/pipeline_orchestrator.py --config pipeline_config.yaml --output dags/
Validate pipeline configuration
python scripts/pipeline_orchestrator.py --config pipeline_config.yaml --validate
Use incremental load template
python scripts/pipeline_orchestrator.py --template incremental --output dags/
Data Quality Validation
Validate CSV file with quality checks
python scripts/data_quality_validator.py --input data/sales.csv --output report.html
Validate database table with custom rules
python scripts/data_quality_validator.py
--connection postgresql://user:pass@host/db
--table sales_transactions
--rules rules/sales_validation.yaml
--threshold 0.95
Performance Optimization
Analyze pipeline performance and get recommendations
python scripts/etl_performance_optimizer.py
--airflow-db postgresql://host/airflow
--dag-id sales_etl_pipeline
--days 30
--optimize
Analyze Spark job performance
python scripts/etl_performance_optimizer.py
--spark-history-server http://spark-history:18080
--app-id app-20250115-001
Real-Time Streaming
Validate streaming pipeline configuration
python scripts/stream_processor.py --config streaming_config.yaml --validate
Generate Kafka topic and client configurations
python scripts/kafka_config_generator.py
--topic user-events
--partitions 12
--replication 3
--output kafka/topics/
Generate exactly-once producer configuration
python scripts/kafka_config_generator.py
--producer
--profile exactly-once
--output kafka/producer.properties
Generate Flink job scaffolding
python scripts/stream_processor.py
--config streaming_config.yaml
--mode flink
--generate
--output flink-jobs/
Monitor streaming quality
python scripts/streaming_quality_validator.py
--lag --consumer-group events-processor --threshold 10000
--freshness --topic processed-events --max-latency-ms 5000
--output streaming-health-report.html
Core Workflows
- Building Production Data Pipelines
Steps:
-
Design Architecture: Choose pattern (Lambda, Kappa, Medallion) based on requirements
-
Configure Pipeline: Create YAML configuration with sources, transformations, targets
-
Generate DAG: python scripts/pipeline_orchestrator.py --config config.yaml
-
Add Quality Checks: Define validation rules for data quality
-
Deploy & Monitor: Deploy to Airflow, configure alerts, track metrics
Pipeline Patterns: See frameworks.md for Lambda Architecture, Kappa Architecture, Medallion Architecture (Bronze/Silver/Gold), and Microservices Data patterns.
Templates: See templates.md for complete Airflow DAG templates, Spark job templates, dbt models, and Docker configurations.
- Data Quality Management
Steps:
-
Define Rules: Create validation rules covering completeness, accuracy, consistency
-
Run Validation: python scripts/data_quality_validator.py --rules rules.yaml
-
Review Results: Analyze quality scores and failed checks
-
Integrate CI/CD: Add validation to pipeline deployment process
-
Monitor Trends: Track quality scores over time
Quality Framework: See frameworks.md for complete Data Quality Framework covering all dimensions (completeness, accuracy, consistency, timeliness, validity).
Validation Templates: See templates.md for validation configuration examples and Python API usage.
- Data Modeling & Transformation
Steps:
-
Choose Modeling Approach: Dimensional (Kimball), Data Vault 2.0, or One Big Table
-
Design Schema: Define fact tables, dimensions, and relationships
-
Implement with dbt: Create staging, intermediate, and mart models
-
Handle SCD: Implement slowly changing dimension logic (Type 1/2/3)
-
Test & Deploy: Run dbt tests, generate documentation, deploy
Modeling Patterns: See frameworks.md for Dimensional Modeling (Kimball), Data Vault 2.0, One Big Table (OBT), and SCD implementations.
dbt Templates: See templates.md for complete dbt model templates including staging, intermediate, fact tables, and SCD Type 2 logic.
- Performance Optimization
Steps:
-
Profile Pipeline: Run performance analyzer on recent pipeline executions
-
Identify Bottlenecks: Review execution time breakdown and slow tasks
-
Apply Optimizations: Implement recommendations (partitioning, indexing, batching)
-
Tune Spark Jobs: Optimize memory, parallelism, and shuffle settings
-
Measure Impact: Compare before/after metrics, track cost savings
Optimization Strategies: See frameworks.md for performance best practices including partitioning strategies, query optimization, and Spark tuning.
Analysis Tools: See tools.md for complete documentation on etl_performance_optimizer.py with query analysis and Spark tuning.
- Building Real-Time Streaming Pipelines
Steps:
-
Architecture Selection: Choose Kappa (streaming-only) or Lambda (batch + streaming) architecture
-
Configure Pipeline: Create YAML config with sources, processing engine, sinks, quality thresholds
-
Generate Kafka Configs: python scripts/kafka_config_generator.py --topic events --partitions 12
-
Generate Job Scaffolding: python scripts/stream_processor.py --mode flink --generate
-
Deploy Infrastructure: Use Docker Compose for local dev, Kubernetes for production
-
Monitor Quality: python scripts/streaming_quality_validator.py --lag --freshness --throughput
Streaming Patterns: See frameworks.md for stateful processing, stream joins, windowing, exactly-once semantics, and CDC patterns.
Templates: See templates.md for Flink DataStream jobs, Kafka Streams applications, PyFlink templates, and Docker Compose configurations.
Python Tools
pipeline_orchestrator.py
Automated Airflow DAG generation with intelligent dependency resolution and monitoring.
Key Features:
-
Generate production-ready DAGs from YAML configuration
-
Automatic task dependency resolution
-
Built-in retry logic and error handling
-
Multi-source support (PostgreSQL, S3, BigQuery, Snowflake)
-
Integrated quality checks and alerting
Usage:
Basic DAG generation
python scripts/pipeline_orchestrator.py --config pipeline_config.yaml --output dags/
With validation
python scripts/pipeline_orchestrator.py --config config.yaml --validate
From template
python scripts/pipeline_orchestrator.py --template incremental --output dags/
Complete Documentation: See tools.md for full configuration options, templates, and integration examples.
data_quality_validator.py
Comprehensive data quality validation framework with automated checks and reporting.
Capabilities:
-
Multi-dimensional validation (completeness, accuracy, consistency, timeliness, validity)
-
Great Expectations integration
-
Custom business rule validation
-
HTML/PDF report generation
-
Anomaly detection
-
Historical trend tracking
Usage:
Validate with custom rules
python scripts/data_quality_validator.py
--input data/sales.csv
--rules rules/sales_validation.yaml
--output report.html
Database table validation
python scripts/data_quality_validator.py
--connection postgresql://host/db
--table sales_transactions
--threshold 0.95
Complete Documentation: See tools.md for rule configuration, API usage, and integration patterns.
etl_performance_optimizer.py
Pipeline performance analysis with actionable optimization recommendations.
Capabilities:
-
Airflow DAG execution profiling
-
Bottleneck detection and analysis
-
SQL query optimization suggestions
-
Spark job tuning recommendations
-
Cost analysis and optimization
-
Historical performance trending
Usage:
Analyze Airflow DAG
python scripts/etl_performance_optimizer.py
--airflow-db postgresql://host/airflow
--dag-id sales_etl_pipeline
--days 30
--optimize
Spark job analysis
python scripts/etl_performance_optimizer.py
--spark-history-server http://spark-history:18080
--app-id app-20250115-001
Complete Documentation: See tools.md for profiling options, optimization strategies, and cost analysis.
stream_processor.py
Streaming pipeline configuration generator and validator for Kafka, Flink, and Kinesis.
Capabilities:
-
Multi-platform support (Kafka, Flink, Kinesis, Spark Streaming)
-
Configuration validation with best practice checks
-
Flink/Spark job scaffolding generation
-
Kafka topic configuration generation
-
Docker Compose for local streaming stacks
-
Exactly-once semantics configuration
Usage:
Validate configuration
python scripts/stream_processor.py --config streaming_config.yaml --validate
Generate Kafka configurations
python scripts/stream_processor.py --config streaming_config.yaml --mode kafka --generate
Generate Flink job scaffolding
python scripts/stream_processor.py --config streaming_config.yaml --mode flink --generate --output flink-jobs/
Generate Docker Compose for local development
python scripts/stream_processor.py --config streaming_config.yaml --mode docker --generate
Complete Documentation: See tools.md for configuration format, validation checks, and generated outputs.
streaming_quality_validator.py
Real-time streaming data quality monitoring with comprehensive health scoring.
Capabilities:
-
Consumer lag monitoring with thresholds
-
Data freshness validation (P50/P95/P99 latency)
-
Schema drift detection
-
Throughput analysis (events/sec, bytes/sec)
-
Dead letter queue rate monitoring
-
Overall quality scoring with recommendations
-
Prometheus metrics export
Usage:
Monitor consumer lag
python scripts/streaming_quality_validator.py
--lag --consumer-group events-processor --threshold 10000
Monitor data freshness
python scripts/streaming_quality_validator.py
--freshness --topic processed-events --max-latency-ms 5000
Full quality validation
python scripts/streaming_quality_validator.py
--lag --freshness --throughput --dlq
--output streaming-health-report.html
Complete Documentation: See tools.md for all monitoring dimensions and integration patterns.
kafka_config_generator.py
Production-grade Kafka configuration generator with performance and security profiles.
Capabilities:
-
Topic configuration (partitions, replication, retention, compaction)
-
Producer profiles (high-throughput, exactly-once, low-latency, ordered)
-
Consumer profiles (exactly-once, high-throughput, batch)
-
Kafka Streams configuration with state store tuning
-
Security configuration (SASL-PLAIN, SASL-SCRAM, mTLS)
-
Kafka Connect source/sink configurations
-
Multiple output formats (properties, YAML, JSON)
Usage:
Generate topic configuration
python scripts/kafka_config_generator.py
--topic user-events --partitions 12 --replication 3 --retention-hours 168
Generate exactly-once producer
python scripts/kafka_config_generator.py
--producer --profile exactly-once --transactional-id producer-001
Generate Kafka Streams config
python scripts/kafka_config_generator.py
--streams --application-id events-processor --exactly-once
Complete Documentation: See tools.md for all profiles, security options, and Connect configurations.
Reference Documentation
Frameworks (frameworks.md)
Comprehensive data engineering frameworks and patterns:
-
Architecture Patterns: Lambda, Kappa, Medallion, Microservices data architecture
-
Data Modeling: Dimensional (Kimball), Data Vault 2.0, One Big Table
-
ETL/ELT Patterns: Full load, incremental load, CDC, SCD, idempotent pipelines
-
Data Quality: Complete framework covering all quality dimensions
-
DataOps: CI/CD for data pipelines, testing strategies, monitoring
-
Orchestration: Airflow DAG patterns, backfill strategies
-
Real-Time Streaming: Stateful processing, stream joins, windowing strategies, exactly-once semantics, event time processing, watermarks, backpressure, Apache Flink patterns, AWS Kinesis patterns, CDC for streaming
-
Governance: Data catalog, lineage tracking, access control
Templates (templates.md)
Production-ready code templates and examples:
-
Airflow DAGs: Complete ETL DAG, incremental load, dynamic task generation
-
Spark Jobs: Batch processing, streaming, optimized configurations
-
dbt Models: Staging, intermediate, fact tables, dimensions with SCD Type 2
-
SQL Patterns: Incremental merge (upsert), deduplication, date spine, window functions
-
Python Pipelines: Data quality validation class, retry decorators, error handling
-
Real-Time Streaming: Apache Flink DataStream jobs (Java), Kafka Streams applications, PyFlink jobs, AWS Kinesis consumers, Docker Compose for streaming stack
-
Kafka Configs: Producer/consumer properties templates, topic configurations, security configurations
-
Docker: Dockerfiles for data pipelines, Docker Compose for local development including streaming stack (Kafka, Flink, Schema Registry)
-
Configuration: dbt project config, Spark configuration, Airflow variables, streaming pipeline YAML
-
Testing: pytest fixtures, integration tests, data quality tests
Tools (tools.md)
Python automation tool documentation:
-
pipeline_orchestrator.py: Complete usage guide, configuration format, DAG templates
-
data_quality_validator.py: Validation rules, dimension checks, Great Expectations integration
-
etl_performance_optimizer.py: Performance analysis, query optimization, Spark tuning
-
stream_processor.py: Streaming pipeline configuration, validation, job scaffolding generation
-
streaming_quality_validator.py: Consumer lag, data freshness, schema drift, throughput monitoring
-
kafka_config_generator.py: Topic, producer, consumer, Kafka Streams, and Connect configurations
-
Integration Patterns: Airflow, dbt, CI/CD, monitoring systems, Prometheus
-
Best Practices: Configuration management, error handling, performance, monitoring, streaming quality
Tech Stack
Core Technologies:
-
Languages: Python 3.8+, SQL, Scala (Spark), Java (Flink)
-
Orchestration: Apache Airflow, Prefect, Dagster
-
Batch Processing: Apache Spark, dbt, Pandas
-
Stream Processing: Apache Kafka, Apache Flink, Kafka Streams, Spark Structured Streaming, AWS Kinesis
-
Storage: PostgreSQL, BigQuery, Snowflake, Redshift, S3, GCS
-
Schema Management: Confluent Schema Registry, AWS Glue Schema Registry
-
Containerization: Docker, Kubernetes
-
Monitoring: Datadog, Prometheus, Grafana, Kafka UI
Data Platforms:
-
Cloud Data Warehouses: Snowflake, BigQuery, Redshift
-
Data Lakes: Delta Lake, Apache Iceberg, Apache Hudi
-
Streaming Platforms: Apache Kafka, AWS Kinesis, Google Pub/Sub, Azure Event Hubs
-
Stream Processing Engines: Apache Flink, Kafka Streams, Spark Structured Streaming
-
Workflow: Airflow, Prefect, Dagster
Integration Points
This skill integrates with:
-
Orchestration: Airflow, Prefect, Dagster for workflow management
-
Transformation: dbt for SQL transformations and testing
-
Quality: Great Expectations for data validation
-
Monitoring: Datadog, Prometheus for pipeline monitoring
-
BI Tools: Looker, Tableau, Power BI for analytics
-
ML Platforms: MLflow, Kubeflow for ML pipeline integration
-
Version Control: Git for pipeline code and configuration
See tools.md for detailed integration patterns and examples.
Best Practices
Pipeline Design:
-
Idempotent operations for safe reruns
-
Incremental processing where possible
-
Clear data lineage and documentation
-
Comprehensive error handling
-
Automated recovery mechanisms
Data Quality:
-
Define quality rules early
-
Validate at every pipeline stage
-
Automate quality monitoring
-
Track quality trends over time
-
Block bad data from downstream
Performance:
-
Partition large tables by date/region
-
Use columnar formats (Parquet, ORC)
-
Leverage predicate pushdown
-
Optimize for your query patterns
-
Monitor and tune regularly
Operations:
-
Version control everything
-
Automate testing and deployment
-
Implement comprehensive monitoring
-
Document runbooks for incidents
-
Regular performance reviews
Performance Targets
Batch Pipeline Execution:
-
P50 latency: < 5 minutes (hourly pipelines)
-
P95 latency: < 15 minutes
-
Success rate: > 99%
-
Data freshness: < 1 hour behind source
Streaming Pipeline Execution:
-
Throughput: 10K+ events/second sustained
-
End-to-end latency: P99 < 1 second
-
Consumer lag: < 10K records behind
-
Exactly-once delivery: Zero duplicates or losses
Data Quality (Batch):
-
Quality score: > 95%
-
Completeness: > 99%
-
Timeliness: < 2 hours data lag
-
Zero critical failures
Streaming Quality:
-
Data freshness: P95 < 5 minutes from event generation
-
Late data rate: < 5% outside watermark window
-
Dead letter queue rate: < 1%
-
Schema compatibility: 100% backward/forward compatible changes
Cost Efficiency:
-
Cost per GB processed: < $0.10
-
Cloud cost trend: Stable or decreasing
-
Resource utilization: > 70%
Resources
-
Frameworks Guide: references/frameworks.md
-
Code Templates: references/templates.md
-
Tool Documentation: references/tools.md
-
Python Scripts: scripts/ directory
Version: 2.0.0 Last Updated: December 16, 2025 Documentation Structure: Progressive disclosure with comprehensive references Streaming Enhancement: Task #8 - Real-time streaming capabilities added