Scikit-learn Debugging Guide
This guide provides a systematic approach to debugging Scikit-learn machine learning code. Follow these phases to identify and resolve issues efficiently.
Common Error Patterns
- ValueError: Shapes Not Aligned
Error: ValueError: shapes (100,5) and (4,) not aligned
Cause: Feature count mismatch between train and test data
Debug steps:
print(f"X_train shape: {X_train.shape}") print(f"X_test shape: {X_test.shape}") print(f"Feature names train: {X_train.columns.tolist() if hasattr(X_train, 'columns') else 'N/A'}") print(f"Feature names test: {X_test.columns.tolist() if hasattr(X_test, 'columns') else 'N/A'}")
Common fixes:
1. Ensure same preprocessing on train and test
2. Use Pipeline to encapsulate all transformations
3. Check for columns dropped during one-hot encoding
- NotFittedError
Error: NotFittedError: This StandardScaler instance is not fitted yet
Cause: Calling transform() or predict() before fit()
from sklearn.utils.validation import check_is_fitted
Check if model is fitted:
try: check_is_fitted(model) print("Model is fitted") except Exception as e: print(f"Model not fitted: {e}")
Debug fitted attributes:
print(f"Model attributes: {[a for a in dir(model) if a.endswith('') and not a.startswith('')]}")
Common fixes:
1. Call fit() before transform() or predict()
2. Use fit_transform() for training data
3. Ensure Pipeline is fitted before prediction
- NaN Values in Input
Error: ValueError: Input contains NaN, infinity or a value too large
Cause: Missing or infinite values in data
import numpy as np import pandas as pd
Diagnose NaN issues:
def diagnose_nan_issues(X, name="X"): if isinstance(X, pd.DataFrame): nan_counts = X.isna().sum() print(f"{name} NaN counts per column:\n{nan_counts[nan_counts > 0]}") print(f"{name} total NaN: {X.isna().sum().sum()}") else: print(f"{name} contains NaN: {np.isnan(X).any()}") print(f"{name} contains inf: {np.isinf(X).any()}") print(f"{name} NaN count: {np.isnan(X).sum()}")
diagnose_nan_issues(X_train, "X_train") diagnose_nan_issues(X_test, "X_test")
Common fixes:
from sklearn.impute import SimpleImputer
Option 1: Remove rows with NaN
X_clean = X[~np.isnan(X).any(axis=1)]
Option 2: Impute missing values
imputer = SimpleImputer(strategy='median') X_imputed = imputer.fit_transform(X_train)
Option 3: Replace infinity
X_train = np.clip(X_train, -1e10, 1e10)
- Feature Mismatch Train/Test
Error: ValueError: X has 10 features, but model expects 12 features
Cause: Different preprocessing on train vs test
Debug feature alignment:
def debug_feature_mismatch(X_train, X_test, model=None): print(f"Train features: {X_train.shape[1]}") print(f"Test features: {X_test.shape[1]}")
if model and hasattr(model, 'n_features_in_'):
print(f"Model expects: {model.n_features_in_} features")
if hasattr(X_train, 'columns') and hasattr(X_test, 'columns'):
train_cols = set(X_train.columns)
test_cols = set(X_test.columns)
print(f"In train but not test: {train_cols - test_cols}")
print(f"In test but not train: {test_cols - train_cols}")
Fix: Use ColumnTransformer with remainder='passthrough' or 'drop'
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler
preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numerical_cols), ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols) ], remainder='drop' # Explicitly handle unknown columns )
- Cross-Validation Issues
Error: ValueError: Cannot have number of splits n_splits=5 greater than samples
Cause: Too few samples for specified fold count
from sklearn.model_selection import cross_val_score, StratifiedKFold
Debug cross-validation setup:
def debug_cv_setup(X, y, cv=5): print(f"Total samples: {len(X)}") print(f"CV folds requested: {cv}") print(f"Min samples per fold: {len(X) // cv}")
if hasattr(y, 'value_counts'):
print(f"Class distribution:\n{y.value_counts()}")
else:
unique, counts = np.unique(y, return_counts=True)
print(f"Class distribution: {dict(zip(unique, counts))}")
Fix: Use appropriate CV strategy
For small datasets:
from sklearn.model_selection import LeaveOneOut, RepeatedStratifiedKFold
For imbalanced data:
cv = StratifiedKFold(n_splits=min(5, y.value_counts().min()))
For time series:
from sklearn.model_selection import TimeSeriesSplit cv = TimeSeriesSplit(n_splits=5)
- Pipeline Configuration Errors
Error: TypeError: All estimators should implement fit and transform
Cause: Final estimator in Pipeline doesn't have transform method
from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer
Debug pipeline structure:
def debug_pipeline(pipe): print("Pipeline steps:") for name, step in pipe.named_steps.items(): has_fit = hasattr(step, 'fit') has_transform = hasattr(step, 'transform') has_predict = hasattr(step, 'predict') print(f" {name}: fit={has_fit}, transform={has_transform}, predict={has_predict}") print(f" Type: {type(step).name}") if hasattr(step, 'get_params'): params = step.get_params() print(f" Params: {params}")
debug_pipeline(my_pipeline)
Common fixes:
1. Only the last step can be a predictor (no transform)
2. Intermediate steps must have fit_transform or fit + transform
3. Use 'passthrough' for no-op steps
- Convergence Warnings
Warning: ConvergenceWarning: lbfgs failed to converge
Cause: Optimization didn't reach convergence criteria
from sklearn.linear_model import LogisticRegression from sklearn.exceptions import ConvergenceWarning import warnings
Capture and analyze warnings:
with warnings.catch_warnings(record=True) as w: warnings.simplefilter("always") model.fit(X_train, y_train)
for warning in w:
if issubclass(warning.category, ConvergenceWarning):
print(f"Convergence issue: {warning.message}")
Common fixes:
1. Increase max_iter
model = LogisticRegression(max_iter=1000)
2. Scale features
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
3. Try different solver
model = LogisticRegression(solver='saga', max_iter=1000)
4. Adjust tolerance
model = LogisticRegression(tol=1e-3)
- Data Leakage Detection
Symptom: Suspiciously high cross-validation scores (>0.99)
Cause: Information from test set leaking into training
Debug data leakage:
def check_for_leakage(X, y, model, cv=5): from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=cv)
print(f"CV scores: {scores}")
print(f"Mean: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
if scores.mean() > 0.99:
print("WARNING: Suspiciously high scores - check for data leakage!")
print("Common causes:")
print(" - Target variable encoded in features")
print(" - Future information in time series")
print(" - Preprocessing before train-test split")
return scores
Fix: Use Pipeline to prevent leakage
from sklearn.pipeline import Pipeline
WRONG - leakage:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Fitted on ALL data
X_train, X_test = train_test_split(X_scaled, ...)
CORRECT - no leakage:
pipe = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression()) ])
fit_transform only sees training data in each CV fold
scores = cross_val_score(pipe, X, y, cv=5)
Debugging Tools
Model Inspection
Get all model parameters
print(model.get_params())
Get only non-default parameters
from sklearn.utils._pprint import _EstimatorPrettyPrinter print(model)
Check fitted attributes (attributes ending with _)
fitted_attrs = [a for a in dir(model) if a.endswith('_') and not a.startswith('__')] for attr in fitted_attrs: val = getattr(model, attr) if hasattr(val, 'shape'): print(f"{attr}: shape={val.shape}") else: print(f"{attr}: {type(val).name}")
Cross-Validation Diagnostics
from sklearn.model_selection import cross_val_score, cross_validate
Basic CV score
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy') print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
Detailed CV with multiple metrics
cv_results = cross_validate( model, X, y, cv=5, scoring=['accuracy', 'precision', 'recall', 'f1'], return_train_score=True, return_estimator=True )
for metric in ['accuracy', 'precision', 'recall', 'f1']: train_key = f'train_{metric}' test_key = f'test_{metric}' print(f"{metric}:") print(f" Train: {cv_results[train_key].mean():.4f}") print(f" Test: {cv_results[test_key].mean():.4f}") print(f" Gap: {cv_results[train_key].mean() - cv_results[test_key].mean():.4f}")
Learning Curve Analysis
from sklearn.model_selection import learning_curve import matplotlib.pyplot as plt
def plot_learning_curve(estimator, X, y, cv=5, train_sizes=None): if train_sizes is None: train_sizes = np.linspace(0.1, 1.0, 10)
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, train_sizes=train_sizes,
scoring='accuracy', n_jobs=-1
)
train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
test_mean = test_scores.mean(axis=1)
test_std = test_scores.std(axis=1)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, 'o-', label='Training score')
plt.plot(train_sizes, test_mean, 'o-', label='Cross-validation score')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.1)
plt.xlabel('Training Examples')
plt.ylabel('Score')
plt.title('Learning Curve')
plt.legend(loc='best')
plt.grid(True)
plt.show()
# Diagnose from learning curve
final_gap = train_mean[-1] - test_mean[-1]
if final_gap > 0.1:
print("DIAGNOSIS: High variance (overfitting)")
print(" - Try regularization")
print(" - Reduce model complexity")
print(" - Get more training data")
elif test_mean[-1] < 0.7:
print("DIAGNOSIS: High bias (underfitting)")
print(" - Increase model complexity")
print(" - Add more features")
print(" - Reduce regularization")
Classification Metrics
from sklearn.metrics import ( confusion_matrix, classification_report, precision_recall_curve, roc_curve, roc_auc_score )
Comprehensive classification report
y_pred = model.predict(X_test) y_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else None
print("Classification Report:") print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:") cm = confusion_matrix(y_test, y_pred) print(cm)
Visualize confusion matrix
import seaborn as sns plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues') plt.xlabel('Predicted') plt.ylabel('Actual') plt.title('Confusion Matrix') plt.show()
Debug class imbalance
print("\nClass Distribution:") print(f"Train: {np.bincount(y_train)}") print(f"Test: {np.bincount(y_test)}")
Pandas Output Configuration
Enable pandas output for transformers (sklearn 1.2+)
from sklearn import set_config
set_config(transform_output="pandas")
Now transformers return DataFrames with column names
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X_train) # Returns DataFrame! print(X_scaled.head()) print(X_scaled.columns.tolist())
Reset to default
set_config(transform_output="default")
The Four Phases (Sklearn-specific)
Phase 1: Data Validation
def validate_data(X, y, name="Dataset"): """Comprehensive data validation before training.""" print(f"=== {name} Validation ===")
# Shape check
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape if hasattr(y, 'shape') else len(y)}")
# Type check
print(f"X dtype: {X.dtype if hasattr(X, 'dtype') else type(X)}")
print(f"y dtype: {y.dtype if hasattr(y, 'dtype') else type(y)}")
# NaN/Inf check
X_arr = np.asarray(X)
print(f"X contains NaN: {np.isnan(X_arr).any()}")
print(f"X contains Inf: {np.isinf(X_arr).any()}")
# Value range
print(f"X value range: [{X_arr.min():.4f}, {X_arr.max():.4f}]")
# Target distribution
if len(np.unique(y)) <= 20: # Classification
unique, counts = np.unique(y, return_counts=True)
print(f"Class distribution: {dict(zip(unique, counts))}")
# Check for class imbalance
ratio = max(counts) / min(counts)
if ratio > 10:
print(f"WARNING: Severe class imbalance (ratio: {ratio:.1f})")
else: # Regression
print(f"y range: [{y.min():.4f}, {y.max():.4f}]")
print(f"y mean: {y.mean():.4f}, std: {y.std():.4f}")
return True
Phase 2: Model Configuration Validation
def validate_model_config(model, X, y): """Validate model configuration before fitting.""" print("=== Model Configuration Validation ===")
params = model.get_params()
print(f"Model: {type(model).__name__}")
print(f"Parameters: {params}")
# Check for common misconfigurations
issues = []
# Classification-specific checks
n_classes = len(np.unique(y))
if n_classes == 2:
# Binary classification
if hasattr(model, 'multi_class') and params.get('multi_class') == 'multinomial':
issues.append("Using multinomial for binary classification")
# Regularization checks
if hasattr(model, 'C'):
if params.get('C', 1.0) > 1000:
issues.append("Very weak regularization (high C)")
if params.get('C', 1.0) < 0.001:
issues.append("Very strong regularization (low C)")
# n_estimators check for ensemble methods
if hasattr(model, 'n_estimators'):
n_est = params.get('n_estimators', 100)
if n_est < 10:
issues.append(f"Low n_estimators ({n_est})")
# max_depth check
if hasattr(model, 'max_depth'):
max_depth = params.get('max_depth')
if max_depth is not None and max_depth > 50:
issues.append(f"Deep tree (max_depth={max_depth}) - potential overfitting")
if issues:
print("Potential issues:")
for issue in issues:
print(f" - {issue}")
else:
print("No obvious configuration issues found")
return len(issues) == 0
Phase 3: Training Diagnostics
def train_with_diagnostics(model, X_train, y_train, X_test, y_test): """Train model with comprehensive diagnostics.""" import time from sklearn.exceptions import ConvergenceWarning import warnings
print("=== Training Diagnostics ===")
# Capture warnings
with warnings.catch_warnings(record=True) as caught_warnings:
warnings.simplefilter("always")
start_time = time.time()
model.fit(X_train, y_train)
train_time = time.time() - start_time
print(f"Training time: {train_time:.2f}s")
# Report warnings
if caught_warnings:
print("\nWarnings during training:")
for w in caught_warnings:
print(f" - {w.category.__name__}: {w.message}")
# Training vs test scores
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"\nTraining score: {train_score:.4f}")
print(f"Test score: {test_score:.4f}")
print(f"Gap: {train_score - test_score:.4f}")
# Diagnose
if train_score - test_score > 0.2:
print("\nDIAGNOSIS: Significant overfitting detected")
elif test_score < 0.6:
print("\nDIAGNOSIS: Model may be underfitting")
elif train_score > 0.99:
print("\nDIAGNOSIS: Perfect training score - possible data leakage")
return model
Phase 4: Prediction Validation
def validate_predictions(model, X_test, y_test): """Validate model predictions.""" print("=== Prediction Validation ===")
y_pred = model.predict(X_test)
# Basic checks
print(f"Prediction shape: {y_pred.shape}")
print(f"Unique predictions: {np.unique(y_pred)}")
# Check for constant predictions
if len(np.unique(y_pred)) == 1:
print("WARNING: Model predicting only one class!")
print(f" Predicted class: {y_pred[0]}")
print(f" Actual class distribution: {np.bincount(y_test.astype(int))}")
# Check prediction distribution matches training distribution
pred_dist = np.bincount(y_pred.astype(int), minlength=len(np.unique(y_test)))
actual_dist = np.bincount(y_test.astype(int))
print(f"\nPrediction distribution: {pred_dist}")
print(f"Actual distribution: {actual_dist}")
# Probability calibration check (if available)
if hasattr(model, 'predict_proba'):
y_proba = model.predict_proba(X_test)
print(f"\nProbability range: [{y_proba.min():.4f}, {y_proba.max():.4f}]")
# Check for overconfident predictions
max_proba = y_proba.max(axis=1)
if (max_proba > 0.99).mean() > 0.5:
print("WARNING: Many overconfident predictions (>99% probability)")
Quick Reference Commands
Data Inspection
Quick data summary
print(X.describe() if hasattr(X, 'describe') else f"Shape: {X.shape}, dtype: {X.dtype}")
Check for issues
print(f"NaN: {np.isnan(X).sum()}, Inf: {np.isinf(X).sum()}")
Feature statistics
print(f"Mean: {X.mean(axis=0)}") print(f"Std: {X.std(axis=0)}")
Model Debugging
Check if fitted
from sklearn.utils.validation import check_is_fitted check_is_fitted(model)
Get parameters
model.get_params()
Get feature importances (tree-based models)
model.feature_importances_
Get coefficients (linear models)
model.coef_, model.intercept_
Pipeline Debugging
Inspect pipeline steps
pipe.named_steps
Get intermediate results
pipe[:-1].transform(X)
Debug specific step
pipe.named_steps['scaler'].mean_
Quick Diagnostics
One-liner diagnostics
from sklearn.model_selection import cross_val_score print(f"CV: {cross_val_score(model, X, y, cv=5).mean():.4f}")
Quick classification report
from sklearn.metrics import classification_report print(classification_report(y_test, model.predict(X_test)))
Learning curve quick check
from sklearn.model_selection import learning_curve sizes, train_scores, test_scores = learning_curve(model, X, y, cv=5) print(f"Learning curve gap: {train_scores[:,-1].mean() - test_scores[:,-1].mean():.4f}")
Memory and Performance
Check memory usage
import sys print(f"X memory: {sys.getsizeof(X) / 1024**2:.2f} MB")
Use sparse matrices for high-dimensional data
from scipy.sparse import csr_matrix X_sparse = csr_matrix(X)
Reduce precision
X_float32 = X.astype(np.float32)
Debugging with Verbose Mode
Enable verbose output during training
from sklearn.linear_model import LogisticRegression model = LogisticRegression(verbose=1)
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(verbose=2)
from sklearn.svm import SVC model = SVC(verbose=True)
Sources
This guide was compiled using information from:
-
Scikit-learn Common Pitfalls Documentation
-
Scikit-learn Developer Tips for Debugging
-
50+ Common Scikit-learn Mistakes and Solutions
-
Sling Academy: Scikit-Learn Common Errors Series
-
Best Practices in Scikit-learn
-
8 Mistakes I Made with Scikit-learn