AI/ML Expert
Core Framework Guidelines
PyTorch
When reviewing or writing PyTorch code, apply these guidelines:
-
Use torch.nn.Module for all model definitions; avoid raw function-based models
-
Move tensors and models to the correct device explicitly: model.to(device) , tensor.to(device)
-
Use model.train() and model.eval() context switches appropriately
-
Accumulate gradients with optimizer.zero_grad() at the top of the training loop
-
Use torch.no_grad() or @torch.inference_mode() for all inference code
-
Pin memory (pin_memory=True ) and use multiple workers in DataLoader for GPU training
-
Use torch.compile() (PyTorch 2.x) for production inference speedups
-
Prefer F.cross_entropy over manual softmax + NLLLoss (numerically stable)
TensorFlow / Keras
When reviewing or writing TensorFlow code, apply these guidelines:
-
Use the Keras functional API or subclassing API; avoid Sequential for complex models
-
Prefer tf.data.Dataset pipelines over manual batching for scalability
-
Use tf.function for graph execution on performance-critical paths
-
Apply mixed precision training: tf.keras.mixed_precision.set_global_policy('mixed_float16')
-
Use tf.saved_model for portable model export; avoid pickling
Hugging Face Transformers
When reviewing or writing Hugging Face code, apply these guidelines:
-
Always use the tokenizer associated with the model checkpoint
-
Set padding=True and truncation=True when tokenizing batches
-
Use AutoModel , AutoTokenizer , and AutoConfig for checkpoint portability
-
Apply model.gradient_checkpointing_enable() to reduce memory for large models
-
Use Trainer API for standard fine-tuning; use custom loops only when Trainer is insufficient
-
Cache models with TRANSFORMERS_CACHE environment variable in CI/CD pipelines
scikit-learn
When reviewing or writing scikit-learn code, apply these guidelines:
-
Use Pipeline to chain preprocessing and model steps; prevents data leakage
-
Use StratifiedKFold for classification tasks with class imbalance
-
Prefer GridSearchCV or RandomizedSearchCV for hyperparameter tuning
-
Always call .fit() only on training data; transform test data with the fitted transformer
-
Serialize models with joblib.dump / joblib.load (faster than pickle for large arrays)
LLM Integration Patterns
Prompt Engineering
-
Structure prompts with a clear system message, context, and user instruction
-
Use few-shot examples in the system prompt for consistent output formatting
-
Apply chain-of-thought prompting ("Think step by step..." ) for complex reasoning tasks
-
Set temperature=0 for deterministic, fact-based outputs; increase for creative tasks
-
Manage token budgets explicitly: estimate prompt tokens before sending
-
Implement output parsing with structured formats (JSON mode, XML tags)
RAG Pipelines
Standard RAG pipeline components
from langchain.embeddings import HuggingFaceEmbeddings from langchain.vectorstores import FAISS # or Chroma, Pinecone, Weaviate from langchain.chains import RetrievalQA
1. Embed and index documents
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2") vectorstore = FAISS.from_documents(documents, embeddings)
2. Retrieve relevant chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
3. Generate with retrieved context
chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
RAG best practices:
-
Chunk documents at natural boundaries (paragraphs, sections), not fixed character counts
-
Use hybrid retrieval: combine dense embeddings with sparse BM25 for better recall
-
Implement semantic caching for repeated queries to reduce latency and cost
-
Validate retrieved context relevance before passing to the LLM
-
Store metadata alongside embeddings for filtering (date, source, author)
LangChain / LangGraph
-
Use LCEL (LangChain Expression Language) for composable chains
-
Apply RunnableParallel for concurrent retrieval steps
-
Use LangGraph for stateful multi-agent workflows with cycles
-
Implement retry logic with RunnableRetry for unreliable external calls
-
Trace and evaluate chains with LangSmith in development
Training Loop Standards
Standard PyTorch training loop with best practices
for epoch in range(num_epochs): model.train() for batch in train_dataloader: optimizer.zero_grad() inputs, labels = batch["input_ids"].to(device), batch["labels"].to(device) outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # gradient clipping optimizer.step() scheduler.step()
# Validation loop
model.eval()
with torch.no_grad():
for batch in val_dataloader:
# evaluate...
Key standards:
-
Proper train/validation/test splits: 80/10/10 or stratified for imbalanced datasets
-
Gradient clipping (max_norm=1.0 ) for stability in Transformer training
-
Learning rate scheduling: cosine annealing with warmup for Transformers
-
Early stopping based on validation loss, not training loss
-
Checkpoint the best model by validation metric, not the final epoch
Fine-Tuning Standards
Full Fine-Tuning
-
Reduce learning rate 10-100x compared to training from scratch
-
Freeze early layers; fine-tune upper layers and task head first
-
Use discriminative learning rates: lower LR for frozen layers, higher for new layers
-
Apply label smoothing (smoothing=0.1 ) to reduce overconfidence
Parameter-Efficient Fine-Tuning (PEFT)
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # LoRA rank lora_alpha=32, # scaling factor target_modules=["q_proj", "v_proj"], lora_dropout=0.05, ) model = get_peft_model(base_model, lora_config) model.print_trainable_parameters() # verify < 1% parameters trainable
PEFT guidelines:
-
Use LoRA rank r=8 to r=64 ; higher rank = more capacity, more memory
-
QLoRA (4-bit quantization + LoRA) for fine-tuning 7B+ models on consumer GPUs
-
Merge adapter weights before serving to eliminate inference overhead
-
Prefer adapter-based methods over full fine-tuning for limited data (< 10K examples)
MLOps and Experiment Tracking
MLflow
import mlflow
with mlflow.start_run(): mlflow.log_params({"learning_rate": lr, "batch_size": bs, "epochs": epochs}) mlflow.log_metrics({"train_loss": loss, "val_accuracy": acc}, step=epoch) mlflow.pytorch.log_model(model, "model")
Weights & Biases
import wandb
wandb.init(project="my-project", config={"lr": 1e-4, "epochs": 10}) wandb.log({"train_loss": loss, "val_f1": f1_score}) wandb.finish()
MLOps standards:
-
Log every hyperparameter and dataset version before training starts
-
Track system metrics (GPU utilization, memory, throughput) alongside model metrics
-
Version datasets with DVC or Delta Lake; never overwrite raw data
-
Use reproducible seeds: torch.manual_seed(42) , np.random.seed(42) , random.seed(42)
-
Register production models in a model registry with stage gates (Staging → Production)
Model Evaluation Standards
Metrics by Task Type
Task Primary Metrics Secondary Metrics
Binary Classification AUC-ROC, F1, Precision/Recall Calibration (Brier Score)
Multi-class Macro F1, Weighted F1, Cohen's Kappa Confusion Matrix
Regression RMSE, MAE, R² Residual Analysis
NLP Generation BLEU, ROUGE, BERTScore Human Evaluation
Ranking/Retrieval NDCG@k, MRR, MAP Hit Rate@k
LLM Evaluation LLM-as-judge, exact match, pass@k Hallucination Rate
Evaluation Best Practices
-
Never tune hyperparameters on the test set; use a held-out validation set
-
Report confidence intervals (bootstrap or cross-validation) for all metrics
-
Disaggregate metrics by subgroup for fairness analysis
-
Use statistical significance tests (McNemar, paired t-test) when comparing models
-
Establish a simple baseline before reporting model results
Production ML Systems
Model Deployment
-
Export to ONNX for cross-platform inference: torch.onnx.export(model, ...)
-
Use TorchServe, Triton Inference Server, or BentoML for serving
-
Apply quantization for CPU deployment: torch.quantization.quantize_dynamic(model, ...)
-
Set up batching with a maximum batch size and timeout for throughput vs latency tradeoffs
-
Use model warming (pre-load and dummy inference) to eliminate cold-start latency
Monitoring and Drift Detection
Example: data drift detection with Evidently
from evidently.report import Report from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()]) report.run(reference_data=reference_df, current_data=production_df) report.save_html("drift_report.html")
Monitoring standards:
-
Track feature distribution drift (KS test, PSI) on a daily schedule
-
Alert on prediction distribution shift (concept drift)
-
Log and sample model inputs/outputs for downstream evaluation
-
Implement shadow mode (run new model alongside production, compare outputs)
-
Define retraining triggers based on drift thresholds, not fixed schedules
Data Preprocessing Standards
Proper train/test split to avoid leakage
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y # stratify for classification )
Fit scaler ONLY on training data
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # transform only, never fit_transform
Standards:
-
Separate preprocessing pipeline per data modality (text, image, tabular)
-
Validate schema and types before entering the pipeline
-
Handle missing values with domain-aware strategies (median, mode, forward-fill)
-
Detect and document outliers; do not silently remove them
-
Apply augmentation only to training data, never validation or test data
Iron Laws
-
ALWAYS fix random seeds and log all hyperparameters before training — non-reproducible experiments cannot be shared, audited, or debugged; use torch.manual_seed(42) , np.random.seed(42) , random.seed(42) and log via MLflow/W&B.
-
NEVER fit preprocessing transformers on test data — fit only on training data, then .transform() test; fitting on test causes data leakage and inflated performance estimates.
-
ALWAYS evaluate with multiple metrics aligned to business goals — never report accuracy alone on imbalanced datasets; use F1, precision-recall curve, and ROC-AUC at minimum.
-
NEVER tune hyperparameters on the test set — use a held-out validation set for tuning; the test set is a one-time final evaluation only.
-
ALWAYS establish a simple baseline before reporting model results — a heuristic or random baseline is mandatory; without it, model quality cannot be assessed.
Anti-Patterns
Anti-Pattern Problem Fix
Ignoring class imbalance Model biased to majority class Stratified sampling, class weights, SMOTE
No validation set Overfitting undetected Hold out 10-20% for validation
Optimizing a single metric Missing failure modes Multiple metrics (precision, recall, F1, AUC)
No baseline comparison Cannot assess model quality Establish heuristic baseline before ML
Accuracy on imbalanced data Misleading performance estimate Use F1, precision-recall curve, ROC-AUC
Data leakage (test in train) Inflated performance estimates Fit on train only; transform test with fitted obj
No error analysis Cannot improve strategically Analyze failure cases by error type
Training without checkpoints Lost progress on failure Save best model by validation metric
Mutable global random state Non-reproducible experiments Fix all seeds; log in experiment metadata
Embedding model in application Cannot update model independently Serve model via API (REST, gRPC)
No latency budget Inference too slow for production Profile and set SLO before deployment
Training a Transformer classifier:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)
def tokenize(batch): return tokenizer(batch["text"], padding=True, truncation=True, max_length=512)
dataset = dataset.map(tokenize, batched=True)
training_args = TrainingArguments( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=16, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="f1", )
trainer = Trainer( model=model, args=training_args, train_dataset=dataset["train"], eval_dataset=dataset["validation"], compute_metrics=compute_metrics, ) trainer.train()
Minimal RAG pipeline:
from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.chains import RetrievalQA from langchain.chat_models import ChatOpenAI
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings()) retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) qa = RetrievalQA.from_chain_type(ChatOpenAI(model="gpt-4o"), retriever=retriever) answer = qa.run("What is the refund policy?")
Assigned Agents
This skill is used by:
-
developer — Implements ML models, data pipelines, and LLM integrations
-
researcher — Investigates novel architectures and evaluates research papers
-
architect — Designs ML system architecture and deployment topology
-
security-architect — Reviews data privacy, model security, and inference safety
Related Skills
-
python-backend-expert — NumPy, Pandas, async Python patterns
-
code-analyzer — Static analysis and complexity metrics for ML code
-
debugging — Systematic debugging for training failures and inference errors
Memory Protocol (MANDATORY)
Before starting:
cat .claude/context/memory/learnings.md
Check for:
-
Previously solved ML patterns in this codebase
-
Known library version pinning requirements
-
Infrastructure constraints (GPU type, memory limits)
After completing:
-
New ML pattern or fix → .claude/context/memory/learnings.md
-
Training failure root cause → .claude/context/memory/issues.md
-
Architecture decision (framework choice, deployment strategy) → .claude/context/memory/decisions.md
ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.