Model Evaluation

Use this skill for rigorously assessing model performance, comparing alternatives, and diagnosing issues.

When to use this skill

Model training complete — need performance assessment
Comparing multiple models/algorithms
Diagnosing overfitting/underfitting
Hyperparameter tuning
Production readiness check

Evaluation workflow

Cross-validation strategy

K-fold (default for most cases)
Stratified K-fold (classification with imbalance)
TimeSeriesSplit (temporal data)
GroupKFold (grouped/clustered data)

Choose appropriate metrics

Classification: accuracy, precision, recall, F1, ROC-AUC, PR-AUC
Regression: MAE, RMSE, R², MAPE
Ranking: NDCG, MAP
Business: custom metrics tied to outcomes

Analyze performance

Cross-validation mean ± std
Validation curve (bias-variance tradeoff)
Learning curves (data sufficiency)
Error analysis by segment

Model comparison

Statistical significance (paired t-test, McNemar)
Calibration (for probability outputs)
Speed vs accuracy tradeoffs

Quick tool selection

Task Default choice Notes

Cross-validation sklearn.model_selection Standard CV, stratified, time series

Metrics sklearn.metrics Comprehensive metric suite

Hyperparameter tuning Optuna or Ray Tune Efficient search algorithms

Model comparison scikit-learn + statistical tests Paired comparisons

Experiment tracking MLflow or Weights & Biases Track runs, metrics, artifacts

Core implementation rules

Always use proper validation

from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc') print(f"CV AUC: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

Match metrics to problem

Classification with imbalance

from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_true, y_pred))

Focus on F1, precision/recall for minority class

Regression

from sklearn.metrics import mean_absolute_error, root_mean_squared_error

print(f"MAE: {mean_absolute_error(y_true, y_pred):.3f}") print(f"RMSE: {root_mean_squared_error(y_true, y_pred):.3f}")

Analyze errors systematically

Error by segment

errors = y_pred != y_true error_df = X_test[errors] error_df['true'] = y_true[errors] error_df['pred'] = y_pred[errors]

Analyze patterns in errors

print(error_df.groupby('category').size())

Track experiments

import mlflow

with mlflow.start_run(): mlflow.log_params(params) mlflow.log_metrics({'auc': auc, 'f1': f1}) mlflow.sklearn.log_model(model, 'model')

Common anti-patterns

❌ Single train/test split without CV
❌ Optimizing wrong metric (accuracy on imbalanced data)
❌ Data leakage in preprocessing
❌ Not checking calibration for probability outputs
❌ Ignoring inference speed/memory constraints
❌ No error analysis or debugging bad predictions

Progressive disclosure

../references/cross-validation.md — CV strategies for different data types
../references/metrics-guide.md — Choosing and interpreting metrics
../references/hyperparameter-tuning.md — Optuna, Ray Tune patterns
../references/experiment-tracking.md — MLflow, W&B setup

Related skills

@data-science-feature-engineering — Features to evaluate
@data-engineering-orchestration — Production model deployment
@data-engineering-observability — Model monitoring in production

References

sklearn Model Selection
sklearn Metrics
Optuna Documentation
MLflow Tracking

data-science-model-evaluation

Safety Notice

Copy this and send it to your AI assistant to learn

Classification with imbalance

Focus on F1, precision/recall for minority class

Regression

Error by segment

Analyze patterns in errors

Source Transparency

Related Skills

data-science-eda

data-science-visualization

data-science-feature-engineering