Model Evaluation
Use this skill for rigorously assessing model performance, comparing alternatives, and diagnosing issues.
When to use this skill
-
Model training complete — need performance assessment
-
Comparing multiple models/algorithms
-
Diagnosing overfitting/underfitting
-
Hyperparameter tuning
-
Production readiness check
Evaluation workflow
Cross-validation strategy
-
K-fold (default for most cases)
-
Stratified K-fold (classification with imbalance)
-
TimeSeriesSplit (temporal data)
-
GroupKFold (grouped/clustered data)
Choose appropriate metrics
-
Classification: accuracy, precision, recall, F1, ROC-AUC, PR-AUC
-
Regression: MAE, RMSE, R², MAPE
-
Ranking: NDCG, MAP
-
Business: custom metrics tied to outcomes
Analyze performance
-
Cross-validation mean ± std
-
Validation curve (bias-variance tradeoff)
-
Learning curves (data sufficiency)
-
Error analysis by segment
Model comparison
-
Statistical significance (paired t-test, McNemar)
-
Calibration (for probability outputs)
-
Speed vs accuracy tradeoffs
Quick tool selection
Task Default choice Notes
Cross-validation sklearn.model_selection Standard CV, stratified, time series
Metrics sklearn.metrics Comprehensive metric suite
Hyperparameter tuning Optuna or Ray Tune Efficient search algorithms
Model comparison scikit-learn + statistical tests Paired comparisons
Experiment tracking MLflow or Weights & Biases Track runs, metrics, artifacts
Core implementation rules
- Always use proper validation
from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc') print(f"CV AUC: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
- Match metrics to problem
Classification with imbalance
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_true, y_pred))
Focus on F1, precision/recall for minority class
Regression
from sklearn.metrics import mean_absolute_error, root_mean_squared_error
print(f"MAE: {mean_absolute_error(y_true, y_pred):.3f}") print(f"RMSE: {root_mean_squared_error(y_true, y_pred):.3f}")
- Analyze errors systematically
Error by segment
errors = y_pred != y_true error_df = X_test[errors] error_df['true'] = y_true[errors] error_df['pred'] = y_pred[errors]
Analyze patterns in errors
print(error_df.groupby('category').size())
- Track experiments
import mlflow
with mlflow.start_run(): mlflow.log_params(params) mlflow.log_metrics({'auc': auc, 'f1': f1}) mlflow.sklearn.log_model(model, 'model')
Common anti-patterns
-
❌ Single train/test split without CV
-
❌ Optimizing wrong metric (accuracy on imbalanced data)
-
❌ Data leakage in preprocessing
-
❌ Not checking calibration for probability outputs
-
❌ Ignoring inference speed/memory constraints
-
❌ No error analysis or debugging bad predictions
Progressive disclosure
-
../references/cross-validation.md — CV strategies for different data types
-
../references/metrics-guide.md — Choosing and interpreting metrics
-
../references/hyperparameter-tuning.md — Optuna, Ray Tune patterns
-
../references/experiment-tracking.md — MLflow, W&B setup
Related skills
-
@data-science-feature-engineering — Features to evaluate
-
@data-engineering-orchestration — Production model deployment
-
@data-engineering-observability — Model monitoring in production
References
-
sklearn Model Selection
-
sklearn Metrics
-
Optuna Documentation
-
MLflow Tracking