ML Inference Optimization
When to Use This Skill
Use this skill when:
-
Optimizing ML inference latency
-
Reducing model size for deployment
-
Implementing model compression techniques
-
Designing inference caching strategies
-
Deploying models at the edge
-
Balancing accuracy vs. latency trade-offs
Keywords: inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration
Inference Optimization Overview
┌─────────────────────────────────────────────────────────────────────┐ │ Inference Optimization Stack │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Model Level │ │ │ │ Distillation │ Pruning │ Quantization │ Architecture Search │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Compiler Level │ │ │ │ Graph optimization │ Operator fusion │ Memory planning │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Runtime Level │ │ │ │ Batching │ Caching │ Async execution │ Multi-threading │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Hardware Level │ │ │ │ GPU │ TPU │ NPU │ CPU SIMD │ Custom accelerators │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘
Model Compression Techniques
Technique Overview
Technique Size Reduction Speed Improvement Accuracy Impact
Quantization 2-4x 2-4x Low (1-2%)
Pruning 2-10x 1-3x Low-Medium
Distillation 3-10x 3-10x Medium
Low-rank factorization 2-5x 1.5-3x Low-Medium
Weight sharing 10-100x Variable Medium-High
Knowledge Distillation
┌─────────────────────────────────────────────────────────────────────┐ │ Knowledge Distillation │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ │ │ │ Teacher Model│ (Large, accurate, slow) │ │ │ GPT-4 │ │ │ └──────────────┘ │ │ │ │ │ ▼ Soft labels (probability distributions) │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Training Process │ │ │ │ Loss = α × CrossEntropy(student, hard_labels) │ │ │ │ + (1-α) × KL_Div(student, teacher_soft_labels) │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │Student Model │ (Small, nearly as accurate, fast) │ │ │ DistilBERT │ │ │ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘
Distillation Types:
Type Description Use Case
Response distillation Match teacher outputs General compression
Feature distillation Match intermediate layers Better transfer
Relation distillation Match sample relationships Structured data
Self-distillation Model teaches itself Regularization
Pruning Strategies
Unstructured Pruning (Weight-level): Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7] After: [0.0, 0.8, 0.0, 0.9, 0.0, 0.7] (50% sparse) • Flexible, high sparsity possible • Needs sparse hardware/libraries
Structured Pruning (Channel/Layer-level): Before: ┌───┬───┬───┬───┐ │ C1│ C2│ C3│ C4│ └───┴───┴───┴───┘ After: ┌───┬───┬───┐ │ C1│ C3│ C4│ (Removed C2 entirely) └───┴───┴───┘ • Works with standard hardware • Lower compression ratio
Pruning Decision Criteria:
Method Description Effectiveness
Magnitude-based Remove smallest weights Simple, effective
Gradient-based Remove low-gradient weights Better accuracy
Second-order Use Hessian information Best but expensive
Lottery ticket Find winning subnetwork Theoretical insight
Quantization (Detailed)
Precision Hierarchy:
FP32 (32 bits): ████████████████████████████████ FP16 (16 bits): ████████████████ BF16 (16 bits): ████████████████ (different mantissa/exponent) INT8 (8 bits): ████████ INT4 (4 bits): ████ Binary (1 bit): █
Memory and Compute Scale Proportionally
Quantization Approaches:
Approach When Applied Quality Effort
Dynamic quantization Runtime Good Low
Static quantization Post-training with calibration Better Medium
QAT During training Best High
Compiler-Level Optimization
Graph Optimization
Original Graph: Input → Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU → Output
Optimized Graph (Operator Fusion): Input → FusedConvBNReLU → FusedConvBNReLU → Output
Benefits: • Fewer kernel launches • Better memory locality • Reduced memory bandwidth
Common Optimizations
Optimization Description Speedup
Operator fusion Combine sequential ops 1.2-2x
Constant folding Pre-compute constants 1.1-1.5x
Dead code elimination Remove unused ops Variable
Layout optimization Optimize tensor memory layout 1.1-1.3x
Memory planning Optimize buffer allocation 1.1-1.2x
Optimization Frameworks
Framework Vendor Best For
TensorRT NVIDIA NVIDIA GPUs, lowest latency
ONNX Runtime Microsoft Cross-platform, broad support
OpenVINO Intel Intel CPUs/GPUs
Core ML Apple Apple devices
TFLite Google Mobile, embedded
Apache TVM Open source Custom hardware, research
Runtime Optimization
Batching Strategies
No Batching: Request 1: [Process] → Response 1 10ms Request 2: [Process] → Response 2 10ms Request 3: [Process] → Response 3 10ms Total: 30ms, GPU underutilized
Dynamic Batching: Requests 1-3: [Wait 5ms] → [Process batch] → Responses Total: 15ms, 2x throughput
Trade-off: Latency vs. Throughput • Larger batch: Higher throughput, higher latency • Smaller batch: Lower latency, lower throughput
Batching Parameters:
Parameter Description Trade-off
batch_size
Maximum batch size Throughput vs. latency
max_wait_time
Wait time for batch fill Latency vs. efficiency
min_batch_size
Minimum before processing Latency predictability
Caching Strategies
┌─────────────────────────────────────────────────────────────────────┐ │ Inference Caching Layers │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Layer 1: Input Cache │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Cache exact inputs → Return cached outputs │ │ │ │ Hit rate: Low (inputs rarely repeat exactly) │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ Layer 2: Embedding Cache │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Cache computed embeddings for repeated tokens/entities │ │ │ │ Hit rate: Medium (common tokens repeat) │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ Layer 3: KV Cache (for transformers) │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Cache key-value pairs for attention │ │ │ │ Hit rate: High (reuse across tokens in sequence) │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ Layer 4: Result Cache │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Cache semantic equivalents (fuzzy matching) │ │ │ │ Hit rate: Variable (depends on query distribution) │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘
Semantic Caching for LLMs:
Query: "What's the capital of France?" ↓ Hash + Embed query ↓ Search cache (similarity > threshold) ↓ ├── Hit: Return cached response └── Miss: Generate → Cache → Return
Async and Parallel Execution
Sequential: ┌─────┐ ┌─────┐ ┌─────┐ │Prep │→│Model│→│Post │ Total: 30ms │10ms │ │15ms │ │5ms │ └─────┘ └─────┘ └─────┘
Pipelined: Request 1: │Prep│Model│Post│ Request 2: │Prep│Model│Post│ Request 3: │Prep│Model│Post│
Throughput: 3x higher Latency per request: Same
Hardware Acceleration
Hardware Comparison
Hardware Strengths Limitations Best For
GPU (NVIDIA) High parallelism, mature ecosystem Power, cost Training, large batch inference
TPU (Google) Matrix ops, cloud integration Vendor lock-in Google Cloud workloads
NPU (Apple/Qualcomm) Power efficient, on-device Limited models Mobile, edge
CPU Flexible, available Slower for ML Low-batch, CPU-bound
FPGA Customizable, low latency Development complexity Specialized workloads
GPU Optimization
Optimization Description Impact
Tensor Cores Use FP16/INT8 tensor operations 2-8x speedup
CUDA graphs Reduce kernel launch overhead 1.5-2x for small models
Multi-stream Parallel execution Higher throughput
Memory pooling Reduce allocation overhead Lower latency variance
Edge Deployment
Edge Constraints
┌─────────────────────────────────────────────────────────────────────┐ │ Edge Deployment Constraints │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Resource Constraints: │ │ ├── Memory: 1-4 GB (vs. 64+ GB cloud) │ │ ├── Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud) │ │ ├── Power: 5-15W (vs. 300W+ cloud) │ │ └── Storage: 16-128 GB (vs. TB cloud) │ │ │ │ Operational Constraints: │ │ ├── No network (offline operation) │ │ ├── Variable ambient conditions │ │ ├── Infrequent updates │ │ └── Long deployment lifetime │ │ │ └─────────────────────────────────────────────────────────────────────┘
Edge Optimization Strategies
Strategy Description Use When
Model selection Use edge-native models (MobileNet, EfficientNet) Accuracy acceptable
Aggressive quantization INT8 or lower Memory/power constrained
On-device distillation Distill to tiny model Extreme constraints
Split inference Edge preprocessing, cloud inference Network available
Model caching Cache results locally Repeated queries
Edge ML Frameworks
Framework Platform Features
TensorFlow Lite Android, iOS, embedded Quantization, delegates
Core ML iOS, macOS Neural Engine optimization
ONNX Runtime Mobile Cross-platform Broad model support
PyTorch Mobile Android, iOS Familiar API
TensorRT NVIDIA Jetson Maximum performance
Latency Profiling
Profiling Methodology
┌─────────────────────────────────────────────────────────────────────┐ │ Latency Breakdown Analysis │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ 1. Data Loading: ████████░░░░░░░░░░ 15% │ │ 2. Preprocessing: ██████░░░░░░░░░░░░ 10% │ │ 3. Model Inference: ████████████████░░ 60% │ │ 4. Postprocessing: ████░░░░░░░░░░░░░░ 8% │ │ 5. Response Serialization:███░░░░░░░░░░░░░░░ 7% │ │ │ │ Target: Model inference (60% = biggest optimization opportunity) │ │ │ └─────────────────────────────────────────────────────────────────────┘
Profiling Tools
Tool Use For
PyTorch Profiler PyTorch model profiling
TensorBoard TensorFlow visualization
NVIDIA Nsight GPU profiling
Chrome Tracing General timeline visualization
perf CPU profiling
Key Metrics
Metric Description Target
P50 latency Median latency < SLA
P99 latency Tail latency < 2x P50
Throughput Requests/second Meet demand
GPU utilization Compute usage
80%
Memory bandwidth Memory usage < limit
Optimization Workflow
Systematic Approach
┌─────────────────────────────────────────────────────────────────────┐ │ Optimization Workflow │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ 1. Baseline │ │ └── Measure current performance (latency, throughput, accuracy) │ │ │ │ 2. Profile │ │ └── Identify bottlenecks (model, data, system) │ │ │ │ 3. Optimize (in order of effort/impact): │ │ ├── Hardware: Use right accelerator │ │ ├── Compiler: Enable optimizations (TensorRT, ONNX) │ │ ├── Runtime: Batching, caching, async │ │ ├── Model: Quantization, pruning │ │ └── Architecture: Distillation, model change │ │ │ │ 4. Validate │ │ └── Verify accuracy maintained, latency improved │ │ │ │ 5. Deploy and Monitor │ │ └── Track real-world performance │ │ │ └─────────────────────────────────────────────────────────────────────┘
Optimization Priority Matrix
High Impact
│
Compiler Opts ────┼──── Quantization
(easy win) │ (best ROI)
│
Low Effort ──────────────┼──────────────── High Effort │ Batching ────┼──── Distillation (quick win) │ (major effort) │ Low Impact
Common Patterns
Multi-Model Serving
┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Request → ┌─────────┐ │ │ │ Router │ │ │ └─────────┘ │ │ │ │ │ │ │ ┌────────┘ │ └────────┐ │ │ ▼ ▼ ▼ │ │ ┌───────┐ ┌───────┐ ┌───────┐ │ │ │ Tiny │ │ Small │ │ Large │ │ │ │ <10ms │ │ <50ms │ │<500ms │ │ │ └───────┘ └───────┘ └───────┘ │ │ │ │ Routing strategies: │ │ • Complexity-based: Simple→Tiny, Complex→Large │ │ • Confidence-based: Try Tiny, escalate if low confidence │ │ • SLA-based: Route based on latency requirements │ │ │ └─────────────────────────────────────────────────────────────────────┘
Speculative Execution
Query: "Translate: Hello" │ ├──▶ Small model (draft): "Bonjour" (5ms) │ └──▶ Large model (verify): Check "Bonjour" (10ms parallel) │ ├── Accept: Return immediately └── Reject: Generate with large model
Speedup: 2-3x when drafts are often accepted
Cascade Models
Input → ┌────────┐ │ Filter │ ← Cheap filter (reject obvious negatives) └────────┘ │ (candidates only) ▼ ┌────────┐ │ Stage 1│ ← Fast model (coarse ranking) └────────┘ │ (top-100) ▼ ┌────────┐ │ Stage 2│ ← Accurate model (fine ranking) └────────┘ │ (top-10) ▼ Output
Benefit: 10x cheaper, similar accuracy
Optimization Checklist
Pre-Deployment
-
Profile baseline performance
-
Identify primary bottleneck (model, data, system)
-
Apply compiler optimizations (TensorRT, ONNX)
-
Evaluate quantization (INT8 usually safe)
-
Tune batch size for target throughput
-
Test accuracy after optimization
Deployment
-
Configure appropriate hardware
-
Enable caching where applicable
-
Set up monitoring (latency, throughput, errors)
-
Configure auto-scaling policies
-
Implement graceful degradation
Post-Deployment
-
Monitor p99 latency
-
Track accuracy metrics
-
Analyze cache hit rates
-
Review cost efficiency
-
Plan iterative improvements
Related Skills
-
llm-serving-patterns
-
LLM-specific serving optimization
-
ml-system-design
-
End-to-end ML pipeline design
-
quality-attributes-taxonomy
-
Performance as quality attribute
-
estimation-techniques
-
Capacity planning for ML systems
Version History
- v1.0.0 (2025-12-26): Initial release - ML inference optimization patterns
Last Updated
Date: 2025-12-26