scikit-learn

Scikit-learn Machine Learning

Industry-standard Python library for classical machine learning.

When to Use

Classification or regression tasks
Clustering or dimensionality reduction
Preprocessing and feature engineering
Model evaluation and cross-validation
Hyperparameter tuning
Building ML pipelines

Algorithm Selection

Classification

Algorithm Best For Strengths

Logistic Regression Baseline, interpretable Fast, probabilistic

Random Forest General purpose Handles non-linear, feature importance

Gradient Boosting Best accuracy State-of-art for tabular

SVM High-dimensional data Works well with few samples

KNN Simple problems No training, instance-based

Regression

Algorithm Best For Notes

Linear Regression Baseline Interpretable coefficients

Ridge/Lasso Regularization needed L2 vs L1 penalty

Random Forest Non-linear relationships Robust to outliers

Gradient Boosting Best accuracy XGBoost, LightGBM wrappers

Clustering

Algorithm Best For Key Parameter

KMeans Spherical clusters n_clusters (must specify)

DBSCAN Arbitrary shapes eps (density)

Agglomerative Hierarchical n_clusters or distance threshold

Gaussian Mixture Soft clustering n_components

Dimensionality Reduction

Method Preserves Use Case

PCA Global variance Feature reduction

t-SNE Local structure 2D/3D visualization

UMAP Both local/global Visualization + downstream

Pipeline Concepts

Key concept: Pipelines prevent data leakage by ensuring transformations are fit only on training data.

Component Purpose

Pipeline Sequential steps (transform → model)

ColumnTransformer Apply different transforms to different columns

FeatureUnion Combine multiple feature extraction methods

Common preprocessing flow:

Impute missing values (SimpleImputer)
Scale numeric features (StandardScaler, MinMaxScaler)
Encode categoricals (OneHotEncoder, OrdinalEncoder)
Optional: feature selection or polynomial features

Model Evaluation

Cross-Validation Strategies

Strategy Use Case

KFold General purpose

StratifiedKFold Imbalanced classification

TimeSeriesSplit Temporal data

LeaveOneOut Very small datasets

Metrics

Task Metric When to Use

Classification Accuracy Balanced classes

F1-score Imbalanced classes

ROC-AUC Ranking, threshold tuning

Precision/Recall Domain-specific costs

Regression RMSE Penalize large errors

MAE Robust to outliers

R² Explained variance

Hyperparameter Tuning

Method Pros Cons

GridSearchCV Exhaustive Slow for many params

RandomizedSearchCV Faster May miss optimal

HalvingGridSearchCV Efficient Requires sklearn 0.24+

Key concept: Always tune on validation set, evaluate final model on held-out test set.

Best Practices

Practice Why

Split data first Prevent leakage

Use pipelines Reproducible, no leakage

Scale for distance-based KNN, SVM, PCA need scaled features

Stratify imbalanced Preserve class distribution

Cross-validate Reliable performance estimates

Check learning curves Diagnose over/underfitting

Common Pitfalls

Pitfall Solution

Fitting scaler on all data Use pipeline or fit only on train

Using accuracy for imbalanced Use F1, ROC-AUC, or balanced accuracy

Too many hyperparameters Start simple, add complexity

Ignoring feature importance Use feature_importances_ or permutation importance

Resources

Docs: https://scikit-learn.org/
User Guide: https://scikit-learn.org/stable/user_guide.html
Algorithm Cheat Sheet: https://scikit-learn.org/stable/tutorial/machine_learning_map/

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

document-processing

stripe-payments

file-organization