Data Science Engineering Suite - Quick Reference
This skill turns raw data and questions into validated, documented models ready for production:
-
EDA workflows: Structured exploration with drift detection
-
Feature engineering: Reproducible feature pipelines with leakage prevention and train/serve parity
-
Model selection: Baselines first; strong tabular defaults; escalate complexity only when justified
-
Evaluation & reporting: Slice analysis, uncertainty, model cards, production metrics
-
SQL transformation: SQLMesh for staging/intermediate/marts layers
-
MLOps: CI/CD, CT (continuous training), CM (continuous monitoring)
-
Production patterns: Data contracts, lineage, feedback loops, streaming features
Modern emphasis (2026): Feature stores, automated retraining, drift monitoring (Evidently), train-serve parity, and agentic ML loops (plan -> execute -> evaluate -> improve). Tools: LightGBM, CatBoost, scikit-learn, PyTorch, Polars (lazy eval for larger-than-RAM datasets), lakeFS for data versioning.
Quick Reference
Task Tool/Framework Command When to Use
EDA & Profiling Pandas, Great Expectations df.describe() , ge.validate()
Initial data exploration and quality checks
Feature Engineering Pandas, Polars, Feature Stores df.transform() , Feast materialization Creating lag, rolling, categorical features
Model Training Gradient boosting, linear models, scikit-learn lgb.train() , model.fit()
Strong baselines for tabular ML
Hyperparameter Tuning Optuna, Ray Tune optuna.create_study() , tune.run()
Optimizing model parameters
SQL Transformation SQLMesh sqlmesh plan , sqlmesh run
Building staging/intermediate/marts layers
Experiment Tracking MLflow, W&B mlflow.log_metric() , wandb.log()
Versioning experiments and models
Model Evaluation scikit-learn, custom metrics metrics.roc_auc_score() , slice analysis Validating model performance
Data Lake & Lakehouse
For comprehensive data lake/lakehouse patterns (beyond SQLMesh transformation), see data-lake-platform:
-
Table formats: Apache Iceberg, Delta Lake, Apache Hudi
-
Query engines: ClickHouse, DuckDB, Apache Doris, StarRocks
-
Alternative transformation: dbt (alternative to SQLMesh)
-
Ingestion: dlt, Airbyte (connectors)
-
Streaming: Apache Kafka patterns
-
Orchestration: Dagster, Airflow
This skill focuses on ML feature engineering and modeling. Use data-lake-platform for general-purpose data infrastructure.
Related Skills
For adjacent topics, reference:
-
ai-mlops - APIs, batch jobs, monitoring, drift, data ingestion (dlt)
-
ai-llm - LLM prompting, fine-tuning, evaluation
-
ai-rag - RAG pipelines, chunking, retrieval
-
ai-llm-inference - LLM inference optimization, quantization
-
ai-ml-timeseries - Time series forecasting, backtesting
-
qa-testing-strategy - Test-driven development, coverage
-
data-sql-optimization - SQL optimization, index patterns (complements SQLMesh)
-
data-lake-platform - Data lake/lakehouse infrastructure (ClickHouse, Iceberg, Kafka)
Decision Tree: Choosing Data Science Approach
User needs ML for: [Problem Type]
-
Tabular data?
- Small-medium (<1M rows)? -> LightGBM (fast, efficient)
- Large and complex (>1M rows)? -> LightGBM first, then NN if needed
- High-dim sparse (text, counts)? -> Linear models, then shallow NN
-
Time series?
- Seasonality? -> LightGBM, then see ai-ml-timeseries
- Long-term dependencies? -> Transformers (see ai-ml-timeseries)
-
Text or mixed modalities?
- LLMs/Transformers -> See ai-llm
-
SQL transformations?
- SQLMesh (staging/intermediate/marts layers)
Rule of thumb: For tabular data, tree-based gradient boosting is a strong baseline, but must be validated against alternatives and constraints.
Core Concepts (Vendor-Agnostic)
-
Problem framing: define success metrics, baselines, and decision thresholds before modeling.
-
Leakage prevention: ensure all features are available at prediction time; split by time/group when appropriate.
-
Uncertainty: report confidence intervals and stability (fold variance, bootstrap) rather than single-point metrics.
-
Reproducibility: version code/data/features, fix seeds, and record the environment.
-
Operational handoff: define monitoring, retraining triggers, and rollback criteria with MLOps.
Implementation Practices (Tooling Examples)
-
Track experiments and artifacts (run id, commit hash, data version).
-
Add data validation gates in pipelines (schema + distribution + freshness).
-
Prefer reproducible, testable feature code (shared transforms, point-in-time correctness).
-
Use datasheets/model cards and eval reports as deployment prerequisites (Datasheets for Datasets: https://arxiv.org/abs/1803.09010; Model Cards: https://arxiv.org/abs/1810.03993).
Do / Avoid
Do
-
Do start with baselines and a simple model to expose leakage and data issues early.
-
Do run slice analysis and document failure modes before recommending deployment.
-
Do keep an immutable eval set; refresh training data without contaminating evaluation.
Avoid
-
Avoid random splits for temporal or user-correlated data.
-
Avoid "metric gaming" (optimizing the number without validating business impact).
-
Avoid training on labels created after the prediction timestamp (silent future leakage).
Core Patterns (Overview)
Pattern 1: End-to-End DS Project Lifecycle
Use when: Starting or restructuring any DS/ML project.
Stages:
-
Problem framing - Business objective, success metrics, baseline
-
Data & feasibility - Sources, coverage, granularity, label quality
-
EDA & data quality - Schema, missingness, outliers, leakage checks
-
Feature engineering - Per data type with feature store integration
-
Modelling - Baselines first, then LightGBM, then complexity as needed
-
Evaluation - Offline metrics, slice analysis, error analysis
-
Reporting - Model evaluation report + model card
-
MLOps - CI/CD, CT (continuous training), CM (continuous monitoring)
Detailed guide: EDA Best Practices
Pattern 2: Feature Engineering
Use when: Designing features before modelling or during model improvement.
By data type:
-
Numeric: Standardize, handle outliers, transform skew, scale
-
Categorical: One-hot/ordinal (low cardinality), target/frequency/hashing (high cardinality)
-
Feature Store Integration: Store encoders, mappings, statistics centrally
-
Text: Cleaning, TF-IDF, embeddings, simple stats
-
Time: Calendar features, recency, rolling/lag features
Key Modern Practice: Use feature stores (Feast, Tecton, Databricks) for versioning, sharing, and train-serve parity.
Detailed guide: Feature Engineering Patterns
Pattern 3: Data Contracts & Lineage
Use when: Building production ML systems with data quality requirements.
Components:
-
Contracts: Schema + ranges/nullability + freshness SLAs
-
Lineage: Track source -> feature store -> train -> serve
-
Feature store hygiene: Materialization cadence, backfill/replay, encoder versioning
-
Schema evolution: Backward/forward-compatible migrations with shadow runs
Detailed guide: Data Contracts & Lineage
Pattern 4: Model Selection & Training
Use when: Picking model families and starting experiments.
Decision guide (modern benchmarks):
-
Tabular: Start with a strong baseline (linear/logistic, then gradient boosting) and iterate based on error analysis
-
Baselines: Always implement simple baselines first (majority class, mean, naive forecast)
-
Train/val/test splits: Time-based (forecasting), group-based (user/item leakage), or random (IID)
-
Hyperparameter tuning: Start manual, then Bayesian optimization (Optuna, Ray Tune)
-
Overfitting control: Regularization, early stopping, cross-validation
Detailed guide: Modelling Patterns
Pattern 5: Evaluation & Reporting
Use when: Finalizing a model candidate or handing over to production.
Key components:
-
Metric selection: Primary (ROC-AUC, PR-AUC, RMSE) + guardrails (calibration, fairness)
-
Threshold selection: ROC/PR curves, cost-sensitive, F1 maximization
-
Slice analysis: Performance by geography, user segments, product categories
-
Error analysis: Collect high-error examples, cluster by error type, identify systematic failures
-
Uncertainty: Confidence intervals (bootstrap where appropriate), variance across folds, and stability checks
-
Evaluation report: 8-section report (objective, data, features, models, metrics, slices, risks, recommendation)
-
Model card: Documentation for stakeholders (intended use, data, performance, ethics, operations)
Detailed guide: Evaluation Patterns
Pattern 6: Reproducibility & MLOps
Use when: Ensuring experiments are reproducible and production-ready.
Modern MLOps (CI/CD/CT/CM):
-
CI (Continuous Integration): Automated testing, data validation, code quality
-
CD (Continuous Delivery): Environment-specific promotion (dev -> staging -> prod), canary deployment
-
CT (Continuous Training): Drift-triggered and scheduled retraining
-
CM (Continuous Monitoring): Real-time data drift, performance, system health
Versioning:
-
Code (git commit), data (DVC, LakeFS), features (feature store), models (MLflow Registry)
-
Seeds (reproducibility), hyperparameters (experiment tracker)
Detailed guide: Reproducibility Checklist
Pattern 7: Feature Freshness & Streaming
Use when: Managing real-time features and streaming pipelines.
Components:
-
Freshness contracts: Define freshness SLAs per feature, monitor lag, alert on breaches
-
Batch + stream parity: Same feature logic across batch/stream, idempotent upserts
-
Schema evolution: Version schemas, add forward/backward-compatible parsers, backfill with rollback
-
Data quality gates: PII/format checks, range checks, distribution drift (KL, KS, PSI)
Detailed guide: Feature Freshness & Streaming
Pattern 8: Production Feedback Loops
Use when: Capturing production signals and implementing continuous improvement.
Components:
-
Signal capture: Log predictions + user edits/acceptance/abandonment (scrub PII)
-
Labeling: Route failures/edge cases to human review, create balanced sets
-
Dataset refresh: Periodic refresh (weekly/monthly) with lineage, protect eval set
-
Online eval: Shadow/canary new models, track solve rate, calibration, cost, latency
Detailed guide: Production Feedback Loops
Resources (Detailed Guides)
For comprehensive operational patterns and checklists, see:
-
EDA Best Practices - Structured workflow for exploratory data analysis
-
Feature Engineering Patterns - Operational patterns by data type
-
Data Contracts & Lineage - Data quality, versioning, feature store ops
-
Modelling Patterns - Model selection, hyperparameter tuning, train/test splits
-
Evaluation Patterns - Metrics, slice analysis, evaluation reports, model cards
-
Reproducibility Checklist - Experiment tracking, MLOps (CI/CD/CT/CM)
-
Feature Freshness & Streaming - Real-time features, schema evolution
-
Production Feedback Loops - Online learning, labeling, canary deployment
-
Class Imbalance Patterns - Resampling, cost-sensitive learning, threshold tuning, evaluation for skewed datasets
-
Hyperparameter Optimization - Bayesian optimization, early stopping, search strategies, budget allocation
-
Interpretability & Explainability - SHAP, LIME, feature importance, model cards for regulated domains
Templates
Use these as copy-paste starting points:
Project & Workflow Templates
-
Standard DS project template: assets/project/template-standard.md
-
Quick DS experiment template: assets/project/template-quick.md
Feature Engineering & EDA
-
Feature engineering template: assets/features/template-feature-engineering.md
-
EDA checklist & notebook template: assets/eda/template-eda.md
Evaluation & Reporting
-
Model evaluation report: assets/evaluation/template-evaluation-report.md
-
Model card: assets/evaluation/template-model-card.md
-
ML experiment review: assets/review/experiment-review-template.md
SQL Transformation (SQLMesh)
For SQL-based data transformation and feature engineering:
-
SQLMesh project setup: ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-project.md
-
SQLMesh model types: ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-model.md (FULL, INCREMENTAL, VIEW)
-
Incremental models: ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-incremental.md
-
DAG and dependencies: ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-dag.md
-
Testing and data quality: ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-testing.md
Use SQLMesh when:
-
Building SQL-based feature pipelines
-
Managing incremental data transformations
-
Creating staging/intermediate/marts layers
-
Testing SQL logic with unit tests and audits
For data ingestion (loading raw data), use:
- ai-mlops skill (dlt templates for REST APIs, databases, warehouses)
Navigation
Resources
-
references/reproducibility-checklist.md
-
references/evaluation-patterns.md
-
references/feature-engineering-patterns.md
-
references/modelling-patterns.md
-
references/feature-freshness-streaming.md
-
references/eda-best-practices.md
-
references/data-contracts-lineage.md
-
references/production-feedback-loops.md
-
references/class-imbalance-patterns.md
-
references/hyperparameter-optimization.md
-
references/interpretability-explainability.md
Templates
-
assets/project/template-standard.md
-
assets/project/template-quick.md
-
assets/features/template-feature-engineering.md
-
assets/eda/template-eda.md
-
assets/evaluation/template-evaluation-report.md
-
assets/evaluation/template-model-card.md
-
assets/review/experiment-review-template.md
-
template-sqlmesh-project.md
-
template-sqlmesh-model.md
-
template-sqlmesh-incremental.md
-
template-sqlmesh-dag.md
-
template-sqlmesh-testing.md
Data
- data/sources.json - Curated external references
External Resources
See data/sources.json for curated foundational and implementation references:
-
Core ML/DL: scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, JAX
-
Data processing: pandas, NumPy, Polars, DuckDB, Spark, Dask
-
SQL transformation: SQLMesh, dbt (staging/marts/incremental patterns)
-
Feature stores: Feast, Tecton, Databricks Feature Store (centralized feature management)
-
Data validation: Pydantic, Great Expectations, Pandera, Evidently (quality + drift)
-
Visualization: Matplotlib, Seaborn, Plotly, Streamlit, Dash
-
MLOps: MLflow, W&B, DVC, Neptune (experiment tracking + model registry)
-
Hyperparameter tuning: Optuna, Ray Tune, Hyperopt
-
Model serving: BentoML, FastAPI, TorchServe, Seldon, Ray Serve
-
Orchestration: Kubeflow, Metaflow, Prefect, Airflow, ZenML
-
Cloud platforms: AWS SageMaker, Google Vertex AI, Azure ML, Databricks, Snowflake
Use this skill to execute data science projects end-to-end: concrete checklists, patterns, and templates, not theory.
Fact-Checking
-
Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
-
Prefer primary sources; report source links and dates for volatile information.
-
If web access is unavailable, state the limitation and mark guidance as unverified.