ml-inference-optimization

ML Inference Optimization

When to Use This Skill

Use this skill when:

Optimizing ML inference latency
Reducing model size for deployment
Implementing model compression techniques
Designing inference caching strategies
Deploying models at the edge
Balancing accuracy vs. latency trade-offs

Keywords: inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration

Inference Optimization Overview

┌─────────────────────────────────────────────────────────────────────┐ │ Inference Optimization Stack │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Model Level │ │ │ │ Distillation │ Pruning │ Quantization │ Architecture Search │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Compiler Level │ │ │ │ Graph optimization │ Operator fusion │ Memory planning │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Runtime Level │ │ │ │ Batching │ Caching │ Async execution │ Multi-threading │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Hardware Level │ │ │ │ GPU │ TPU │ NPU │ CPU SIMD │ Custom accelerators │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘

Model Compression Techniques

Technique Overview

Technique Size Reduction Speed Improvement Accuracy Impact

Quantization 2-4x 2-4x Low (1-2%)

Pruning 2-10x 1-3x Low-Medium

Distillation 3-10x 3-10x Medium

Low-rank factorization 2-5x 1.5-3x Low-Medium

Weight sharing 10-100x Variable Medium-High

Knowledge Distillation

┌─────────────────────────────────────────────────────────────────────┐ │ Knowledge Distillation │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ │ │ │ Teacher Model│ (Large, accurate, slow) │ │ │ GPT-4 │ │ │ └──────────────┘ │ │ │ │ │ ▼ Soft labels (probability distributions) │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Training Process │ │ │ │ Loss = α × CrossEntropy(student, hard_labels) │ │ │ │ + (1-α) × KL_Div(student, teacher_soft_labels) │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │Student Model │ (Small, nearly as accurate, fast) │ │ │ DistilBERT │ │ │ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘

Distillation Types:

Type Description Use Case

Response distillation Match teacher outputs General compression

Feature distillation Match intermediate layers Better transfer

Relation distillation Match sample relationships Structured data

Self-distillation Model teaches itself Regularization

Pruning Strategies

Unstructured Pruning (Weight-level): Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7] After: [0.0, 0.8, 0.0, 0.9, 0.0, 0.7] (50% sparse) • Flexible, high sparsity possible • Needs sparse hardware/libraries

Structured Pruning (Channel/Layer-level): Before: ┌───┬───┬───┬───┐ │ C1│ C2│ C3│ C4│ └───┴───┴───┴───┘ After: ┌───┬───┬───┐ │ C1│ C3│ C4│ (Removed C2 entirely) └───┴───┴───┘ • Works with standard hardware • Lower compression ratio

Pruning Decision Criteria:

Method Description Effectiveness

Magnitude-based Remove smallest weights Simple, effective

Gradient-based Remove low-gradient weights Better accuracy

Second-order Use Hessian information Best but expensive

Lottery ticket Find winning subnetwork Theoretical insight

Quantization (Detailed)

Precision Hierarchy:

FP32 (32 bits): ████████████████████████████████ FP16 (16 bits): ████████████████ BF16 (16 bits): ████████████████ (different mantissa/exponent) INT8 (8 bits): ████████ INT4 (4 bits): ████ Binary (1 bit): █

Memory and Compute Scale Proportionally

Quantization Approaches:

Approach When Applied Quality Effort

Dynamic quantization Runtime Good Low

Static quantization Post-training with calibration Better Medium

QAT During training Best High

Compiler-Level Optimization

Graph Optimization

Original Graph: Input → Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU → Output

Optimized Graph (Operator Fusion): Input → FusedConvBNReLU → FusedConvBNReLU → Output

Benefits: • Fewer kernel launches • Better memory locality • Reduced memory bandwidth

Common Optimizations

Optimization Description Speedup

Operator fusion Combine sequential ops 1.2-2x

Constant folding Pre-compute constants 1.1-1.5x

Dead code elimination Remove unused ops Variable

Layout optimization Optimize tensor memory layout 1.1-1.3x

Memory planning Optimize buffer allocation 1.1-1.2x

Optimization Frameworks

Framework Vendor Best For

TensorRT NVIDIA NVIDIA GPUs, lowest latency

ONNX Runtime Microsoft Cross-platform, broad support

OpenVINO Intel Intel CPUs/GPUs

Core ML Apple Apple devices

TFLite Google Mobile, embedded

Apache TVM Open source Custom hardware, research

Runtime Optimization

Batching Strategies

No Batching: Request 1: [Process] → Response 1 10ms Request 2: [Process] → Response 2 10ms Request 3: [Process] → Response 3 10ms Total: 30ms, GPU underutilized

Dynamic Batching: Requests 1-3: [Wait 5ms] → [Process batch] → Responses Total: 15ms, 2x throughput

Trade-off: Latency vs. Throughput • Larger batch: Higher throughput, higher latency • Smaller batch: Lower latency, lower throughput

Batching Parameters:

Parameter Description Trade-off

batch_size

Maximum batch size Throughput vs. latency

max_wait_time

Wait time for batch fill Latency vs. efficiency

min_batch_size

Minimum before processing Latency predictability

Caching Strategies

┌─────────────────────────────────────────────────────────────────────┐ │ Inference Caching Layers │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Layer 1: Input Cache │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Cache exact inputs → Return cached outputs │ │ │ │ Hit rate: Low (inputs rarely repeat exactly) │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ Layer 2: Embedding Cache │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Cache computed embeddings for repeated tokens/entities │ │ │ │ Hit rate: Medium (common tokens repeat) │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ Layer 3: KV Cache (for transformers) │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Cache key-value pairs for attention │ │ │ │ Hit rate: High (reuse across tokens in sequence) │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ Layer 4: Result Cache │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Cache semantic equivalents (fuzzy matching) │ │ │ │ Hit rate: Variable (depends on query distribution) │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘

Semantic Caching for LLMs:

Query: "What's the capital of France?" ↓ Hash + Embed query ↓ Search cache (similarity > threshold) ↓ ├── Hit: Return cached response └── Miss: Generate → Cache → Return

Async and Parallel Execution

Sequential: ┌─────┐ ┌─────┐ ┌─────┐ │Prep │→│Model│→│Post │ Total: 30ms │10ms │ │15ms │ │5ms │ └─────┘ └─────┘ └─────┘

Pipelined: Request 1: │Prep│Model│Post│ Request 2: │Prep│Model│Post│ Request 3: │Prep│Model│Post│

Throughput: 3x higher Latency per request: Same

Hardware Acceleration

Hardware Comparison

Hardware Strengths Limitations Best For

GPU (NVIDIA) High parallelism, mature ecosystem Power, cost Training, large batch inference

TPU (Google) Matrix ops, cloud integration Vendor lock-in Google Cloud workloads

NPU (Apple/Qualcomm) Power efficient, on-device Limited models Mobile, edge

CPU Flexible, available Slower for ML Low-batch, CPU-bound

FPGA Customizable, low latency Development complexity Specialized workloads

GPU Optimization

Optimization Description Impact

Tensor Cores Use FP16/INT8 tensor operations 2-8x speedup

CUDA graphs Reduce kernel launch overhead 1.5-2x for small models

Multi-stream Parallel execution Higher throughput

Memory pooling Reduce allocation overhead Lower latency variance

Edge Deployment

Edge Constraints

┌─────────────────────────────────────────────────────────────────────┐ │ Edge Deployment Constraints │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Resource Constraints: │ │ ├── Memory: 1-4 GB (vs. 64+ GB cloud) │ │ ├── Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud) │ │ ├── Power: 5-15W (vs. 300W+ cloud) │ │ └── Storage: 16-128 GB (vs. TB cloud) │ │ │ │ Operational Constraints: │ │ ├── No network (offline operation) │ │ ├── Variable ambient conditions │ │ ├── Infrequent updates │ │ └── Long deployment lifetime │ │ │ └─────────────────────────────────────────────────────────────────────┘

Edge Optimization Strategies

Strategy Description Use When

Model selection Use edge-native models (MobileNet, EfficientNet) Accuracy acceptable

Aggressive quantization INT8 or lower Memory/power constrained

On-device distillation Distill to tiny model Extreme constraints

Split inference Edge preprocessing, cloud inference Network available

Model caching Cache results locally Repeated queries

Edge ML Frameworks

Framework Platform Features

TensorFlow Lite Android, iOS, embedded Quantization, delegates

Core ML iOS, macOS Neural Engine optimization

ONNX Runtime Mobile Cross-platform Broad model support

PyTorch Mobile Android, iOS Familiar API

TensorRT NVIDIA Jetson Maximum performance

Latency Profiling

Profiling Methodology

┌─────────────────────────────────────────────────────────────────────┐ │ Latency Breakdown Analysis │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ 1. Data Loading: ████████░░░░░░░░░░ 15% │ │ 2. Preprocessing: ██████░░░░░░░░░░░░ 10% │ │ 3. Model Inference: ████████████████░░ 60% │ │ 4. Postprocessing: ████░░░░░░░░░░░░░░ 8% │ │ 5. Response Serialization:███░░░░░░░░░░░░░░░ 7% │ │ │ │ Target: Model inference (60% = biggest optimization opportunity) │ │ │ └─────────────────────────────────────────────────────────────────────┘

Profiling Tools

Tool Use For

PyTorch Profiler PyTorch model profiling

TensorBoard TensorFlow visualization

NVIDIA Nsight GPU profiling

Chrome Tracing General timeline visualization

perf CPU profiling

Key Metrics

Metric Description Target

P50 latency Median latency < SLA

P99 latency Tail latency < 2x P50

Throughput Requests/second Meet demand

GPU utilization Compute usage

80%

Memory bandwidth Memory usage < limit

Optimization Workflow

Systematic Approach

┌─────────────────────────────────────────────────────────────────────┐ │ Optimization Workflow │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ 1. Baseline │ │ └── Measure current performance (latency, throughput, accuracy) │ │ │ │ 2. Profile │ │ └── Identify bottlenecks (model, data, system) │ │ │ │ 3. Optimize (in order of effort/impact): │ │ ├── Hardware: Use right accelerator │ │ ├── Compiler: Enable optimizations (TensorRT, ONNX) │ │ ├── Runtime: Batching, caching, async │ │ ├── Model: Quantization, pruning │ │ └── Architecture: Distillation, model change │ │ │ │ 4. Validate │ │ └── Verify accuracy maintained, latency improved │ │ │ │ 5. Deploy and Monitor │ │ └── Track real-world performance │ │ │ └─────────────────────────────────────────────────────────────────────┘

Optimization Priority Matrix

                High Impact
                     │
Compiler Opts    ────┼──── Quantization
(easy win)           │     (best ROI)
                     │

Low Effort ──────────────┼──────────────── High Effort │ Batching ────┼──── Distillation (quick win) │ (major effort) │ Low Impact

Common Patterns

Multi-Model Serving

┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Request → ┌─────────┐ │ │ │ Router │ │ │ └─────────┘ │ │ │ │ │ │ │ ┌────────┘ │ └────────┐ │ │ ▼ ▼ ▼ │ │ ┌───────┐ ┌───────┐ ┌───────┐ │ │ │ Tiny │ │ Small │ │ Large │ │ │ │ <10ms │ │ <50ms │ │<500ms │ │ │ └───────┘ └───────┘ └───────┘ │ │ │ │ Routing strategies: │ │ • Complexity-based: Simple→Tiny, Complex→Large │ │ • Confidence-based: Try Tiny, escalate if low confidence │ │ • SLA-based: Route based on latency requirements │ │ │ └─────────────────────────────────────────────────────────────────────┘

Speculative Execution

Query: "Translate: Hello" │ ├──▶ Small model (draft): "Bonjour" (5ms) │ └──▶ Large model (verify): Check "Bonjour" (10ms parallel) │ ├── Accept: Return immediately └── Reject: Generate with large model

Speedup: 2-3x when drafts are often accepted

Cascade Models

Input → ┌────────┐ │ Filter │ ← Cheap filter (reject obvious negatives) └────────┘ │ (candidates only) ▼ ┌────────┐ │ Stage 1│ ← Fast model (coarse ranking) └────────┘ │ (top-100) ▼ ┌────────┐ │ Stage 2│ ← Accurate model (fine ranking) └────────┘ │ (top-10) ▼ Output

Benefit: 10x cheaper, similar accuracy

Optimization Checklist

Pre-Deployment

Profile baseline performance
Identify primary bottleneck (model, data, system)
Apply compiler optimizations (TensorRT, ONNX)
Evaluate quantization (INT8 usually safe)
Tune batch size for target throughput
Test accuracy after optimization

Deployment

Configure appropriate hardware
Enable caching where applicable
Set up monitoring (latency, throughput, errors)
Configure auto-scaling policies
Implement graceful degradation

Post-Deployment

Monitor p99 latency
Track accuracy metrics
Analyze cache hit rates
Review cost efficiency
Plan iterative improvements

Related Skills

llm-serving-patterns
LLM-specific serving optimization
ml-system-design
End-to-end ML pipeline design
quality-attributes-taxonomy
Performance as quality attribute
estimation-techniques
Capacity planning for ML systems

Version History

v1.0.0 (2025-12-26): Initial release - ML inference optimization patterns

Last Updated

Date: 2025-12-26

ml-inference-optimization

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering

swot-pestle-analysis