AI/ML Expert

Core Framework Guidelines

PyTorch

When reviewing or writing PyTorch code, apply these guidelines:

Use torch.nn.Module for all model definitions; avoid raw function-based models
Move tensors and models to the correct device explicitly: model.to(device) , tensor.to(device)
Use model.train() and model.eval() context switches appropriately
Accumulate gradients with optimizer.zero_grad() at the top of the training loop
Use torch.no_grad() or @torch.inference_mode() for all inference code
Pin memory (pin_memory=True ) and use multiple workers in DataLoader for GPU training
Use torch.compile() (PyTorch 2.x) for production inference speedups
Prefer F.cross_entropy over manual softmax + NLLLoss (numerically stable)

TensorFlow / Keras

When reviewing or writing TensorFlow code, apply these guidelines:

Use the Keras functional API or subclassing API; avoid Sequential for complex models
Prefer tf.data.Dataset pipelines over manual batching for scalability
Use tf.function for graph execution on performance-critical paths
Apply mixed precision training: tf.keras.mixed_precision.set_global_policy('mixed_float16')
Use tf.saved_model for portable model export; avoid pickling

Hugging Face Transformers

When reviewing or writing Hugging Face code, apply these guidelines:

Always use the tokenizer associated with the model checkpoint
Set padding=True and truncation=True when tokenizing batches
Use AutoModel , AutoTokenizer , and AutoConfig for checkpoint portability
Apply model.gradient_checkpointing_enable() to reduce memory for large models
Use Trainer API for standard fine-tuning; use custom loops only when Trainer is insufficient
Cache models with TRANSFORMERS_CACHE environment variable in CI/CD pipelines

scikit-learn

When reviewing or writing scikit-learn code, apply these guidelines:

Use Pipeline to chain preprocessing and model steps; prevents data leakage
Use StratifiedKFold for classification tasks with class imbalance
Prefer GridSearchCV or RandomizedSearchCV for hyperparameter tuning
Always call .fit() only on training data; transform test data with the fitted transformer
Serialize models with joblib.dump / joblib.load (faster than pickle for large arrays)

LLM Integration Patterns

Prompt Engineering

Structure prompts with a clear system message, context, and user instruction
Use few-shot examples in the system prompt for consistent output formatting
Apply chain-of-thought prompting ("Think step by step..." ) for complex reasoning tasks
Set temperature=0 for deterministic, fact-based outputs; increase for creative tasks
Manage token budgets explicitly: estimate prompt tokens before sending
Implement output parsing with structured formats (JSON mode, XML tags)

RAG Pipelines

Standard RAG pipeline components

from langchain.embeddings import HuggingFaceEmbeddings from langchain.vectorstores import FAISS # or Chroma, Pinecone, Weaviate from langchain.chains import RetrievalQA

1. Embed and index documents

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2") vectorstore = FAISS.from_documents(documents, embeddings)

2. Retrieve relevant chunks

retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

3. Generate with retrieved context

chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

RAG best practices:

Chunk documents at natural boundaries (paragraphs, sections), not fixed character counts
Use hybrid retrieval: combine dense embeddings with sparse BM25 for better recall
Implement semantic caching for repeated queries to reduce latency and cost
Validate retrieved context relevance before passing to the LLM
Store metadata alongside embeddings for filtering (date, source, author)

LangChain / LangGraph

Use LCEL (LangChain Expression Language) for composable chains
Apply RunnableParallel for concurrent retrieval steps
Use LangGraph for stateful multi-agent workflows with cycles
Implement retry logic with RunnableRetry for unreliable external calls
Trace and evaluate chains with LangSmith in development

Training Loop Standards

Standard PyTorch training loop with best practices

for epoch in range(num_epochs): model.train() for batch in train_dataloader: optimizer.zero_grad() inputs, labels = batch["input_ids"].to(device), batch["labels"].to(device) outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # gradient clipping optimizer.step() scheduler.step()

# Validation loop
model.eval()
with torch.no_grad():
    for batch in val_dataloader:
        # evaluate...

Key standards:

Proper train/validation/test splits: 80/10/10 or stratified for imbalanced datasets
Gradient clipping (max_norm=1.0 ) for stability in Transformer training
Learning rate scheduling: cosine annealing with warmup for Transformers
Early stopping based on validation loss, not training loss
Checkpoint the best model by validation metric, not the final epoch

Fine-Tuning Standards

Full Fine-Tuning

Reduce learning rate 10-100x compared to training from scratch
Freeze early layers; fine-tune upper layers and task head first
Use discriminative learning rates: lower LR for frozen layers, higher for new layers
Apply label smoothing (smoothing=0.1 ) to reduce overconfidence

Parameter-Efficient Fine-Tuning (PEFT)

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # LoRA rank lora_alpha=32, # scaling factor target_modules=["q_proj", "v_proj"], lora_dropout=0.05, ) model = get_peft_model(base_model, lora_config) model.print_trainable_parameters() # verify < 1% parameters trainable

PEFT guidelines:

Use LoRA rank r=8 to r=64 ; higher rank = more capacity, more memory
QLoRA (4-bit quantization + LoRA) for fine-tuning 7B+ models on consumer GPUs
Merge adapter weights before serving to eliminate inference overhead
Prefer adapter-based methods over full fine-tuning for limited data (< 10K examples)

MLOps and Experiment Tracking

MLflow

import mlflow

with mlflow.start_run(): mlflow.log_params({"learning_rate": lr, "batch_size": bs, "epochs": epochs}) mlflow.log_metrics({"train_loss": loss, "val_accuracy": acc}, step=epoch) mlflow.pytorch.log_model(model, "model")

Weights & Biases

import wandb

wandb.init(project="my-project", config={"lr": 1e-4, "epochs": 10}) wandb.log({"train_loss": loss, "val_f1": f1_score}) wandb.finish()

MLOps standards:

Log every hyperparameter and dataset version before training starts
Track system metrics (GPU utilization, memory, throughput) alongside model metrics
Version datasets with DVC or Delta Lake; never overwrite raw data
Use reproducible seeds: torch.manual_seed(42) , np.random.seed(42) , random.seed(42)
Register production models in a model registry with stage gates (Staging → Production)

Model Evaluation Standards

Metrics by Task Type

Task Primary Metrics Secondary Metrics

Binary Classification AUC-ROC, F1, Precision/Recall Calibration (Brier Score)

Multi-class Macro F1, Weighted F1, Cohen's Kappa Confusion Matrix

Regression RMSE, MAE, R² Residual Analysis

NLP Generation BLEU, ROUGE, BERTScore Human Evaluation

Ranking/Retrieval NDCG@k, MRR, MAP Hit Rate@k

LLM Evaluation LLM-as-judge, exact match, pass@k Hallucination Rate

Evaluation Best Practices

Never tune hyperparameters on the test set; use a held-out validation set
Report confidence intervals (bootstrap or cross-validation) for all metrics
Disaggregate metrics by subgroup for fairness analysis
Use statistical significance tests (McNemar, paired t-test) when comparing models
Establish a simple baseline before reporting model results

Production ML Systems

Model Deployment

Export to ONNX for cross-platform inference: torch.onnx.export(model, ...)
Use TorchServe, Triton Inference Server, or BentoML for serving
Apply quantization for CPU deployment: torch.quantization.quantize_dynamic(model, ...)
Set up batching with a maximum batch size and timeout for throughput vs latency tradeoffs
Use model warming (pre-load and dummy inference) to eliminate cold-start latency

Monitoring and Drift Detection

Example: data drift detection with Evidently

from evidently.report import Report from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()]) report.run(reference_data=reference_df, current_data=production_df) report.save_html("drift_report.html")

Monitoring standards:

Track feature distribution drift (KS test, PSI) on a daily schedule
Alert on prediction distribution shift (concept drift)
Log and sample model inputs/outputs for downstream evaluation
Implement shadow mode (run new model alongside production, compare outputs)
Define retraining triggers based on drift thresholds, not fixed schedules

Data Preprocessing Standards

Proper train/test split to avoid leakage

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y # stratify for classification )

Fit scaler ONLY on training data

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # transform only, never fit_transform

Standards:

Separate preprocessing pipeline per data modality (text, image, tabular)
Validate schema and types before entering the pipeline
Handle missing values with domain-aware strategies (median, mode, forward-fill)
Detect and document outliers; do not silently remove them
Apply augmentation only to training data, never validation or test data

Iron Laws

ALWAYS fix random seeds and log all hyperparameters before training — non-reproducible experiments cannot be shared, audited, or debugged; use torch.manual_seed(42) , np.random.seed(42) , random.seed(42) and log via MLflow/W&B.
NEVER fit preprocessing transformers on test data — fit only on training data, then .transform() test; fitting on test causes data leakage and inflated performance estimates.
ALWAYS evaluate with multiple metrics aligned to business goals — never report accuracy alone on imbalanced datasets; use F1, precision-recall curve, and ROC-AUC at minimum.
NEVER tune hyperparameters on the test set — use a held-out validation set for tuning; the test set is a one-time final evaluation only.
ALWAYS establish a simple baseline before reporting model results — a heuristic or random baseline is mandatory; without it, model quality cannot be assessed.

Anti-Patterns

Anti-Pattern Problem Fix

Ignoring class imbalance Model biased to majority class Stratified sampling, class weights, SMOTE

No validation set Overfitting undetected Hold out 10-20% for validation

Optimizing a single metric Missing failure modes Multiple metrics (precision, recall, F1, AUC)

No baseline comparison Cannot assess model quality Establish heuristic baseline before ML

Accuracy on imbalanced data Misleading performance estimate Use F1, precision-recall curve, ROC-AUC

Data leakage (test in train) Inflated performance estimates Fit on train only; transform test with fitted obj

No error analysis Cannot improve strategically Analyze failure cases by error type

Training without checkpoints Lost progress on failure Save best model by validation metric

Mutable global random state Non-reproducible experiments Fix all seeds; log in experiment metadata

Embedding model in application Cannot update model independently Serve model via API (REST, gRPC)

No latency budget Inference too slow for production Profile and set SLO before deployment

Training a Transformer classifier:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)

def tokenize(batch): return tokenizer(batch["text"], padding=True, truncation=True, max_length=512)

dataset = dataset.map(tokenize, batched=True)

training_args = TrainingArguments( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=16, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="f1", )

trainer = Trainer( model=model, args=training_args, train_dataset=dataset["train"], eval_dataset=dataset["validation"], compute_metrics=compute_metrics, ) trainer.train()

Minimal RAG pipeline:

from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.chains import RetrievalQA from langchain.chat_models import ChatOpenAI

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings()) retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) qa = RetrievalQA.from_chain_type(ChatOpenAI(model="gpt-4o"), retriever=retriever) answer = qa.run("What is the refund policy?")

Assigned Agents

This skill is used by:

developer — Implements ML models, data pipelines, and LLM integrations
researcher — Investigates novel architectures and evaluates research papers
architect — Designs ML system architecture and deployment topology
security-architect — Reviews data privacy, model security, and inference safety

Related Skills

python-backend-expert — NumPy, Pandas, async Python patterns
code-analyzer — Static analysis and complexity metrics for ML code
debugging — Systematic debugging for training failures and inference errors

Memory Protocol (MANDATORY)

Before starting:

cat .claude/context/memory/learnings.md

Check for:

Previously solved ML patterns in this codebase
Known library version pinning requirements
Infrastructure constraints (GPU type, memory limits)

After completing:

New ML pattern or fix → .claude/context/memory/learnings.md
Training failure root cause → .claude/context/memory/issues.md
Architecture decision (framework choice, deployment strategy) → .claude/context/memory/decisions.md

ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

ai-ml-expert

Safety Notice

Copy this and send it to your AI assistant to learn

Standard RAG pipeline components

1. Embed and index documents

2. Retrieve relevant chunks

3. Generate with retrieved context

Standard PyTorch training loop with best practices

Example: data drift detection with Evidently

Proper train/test split to avoid leakage

Fit scaler ONLY on training data

Source Transparency

Related Skills

filesystem

slack-notifications

chrome-browser

text-to-sql