LLM Inference Scaling
Scale LLM inference horizontally on Kubernetes with GPU-aware autoscaling, request queuing, and cost-efficient spot instance strategies.
When to Use This Skill
Use this skill when:
-
LLM API traffic is unpredictable and you need to scale up/down automatically
-
Managing a fleet of vLLM or TGI inference pods on Kubernetes
-
Reducing inference costs with spot/preemptible GPU instances
-
Implementing queue-based autoscaling for batch inference jobs
-
Building a multi-model serving platform that shares GPU resources
Prerequisites
-
Kubernetes cluster with GPU nodes (NVIDIA operator installed)
-
KEDA (Kubernetes Event-Driven Autoscaler) installed
-
Prometheus with GPU metrics (dcgm-exporter or gpu-operator )
-
Helm 3+ for chart deployments
GPU Node Setup
Install NVIDIA GPU Operator (handles drivers, container toolkit, DCGM)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update
helm install gpu-operator nvidia/gpu-operator
--namespace gpu-operator
--create-namespace
--set driver.enabled=true
--set dcgm.enabled=true
--set devicePlugin.enabled=true
Verify GPU nodes are recognized
kubectl get nodes -l nvidia.com/gpu.present=true kubectl describe node <gpu-node> | grep nvidia
vLLM Deployment with GPU Resources
apiVersion: apps/v1 kind: Deployment metadata: name: vllm-llama-8b labels: app: vllm model: llama-3.1-8b spec: replicas: 1 selector: matchLabels: app: vllm model: llama-3.1-8b template: metadata: labels: app: vllm model: llama-3.1-8b spec: nodeSelector: nvidia.com/gpu.present: "true" tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: vllm image: vllm/vllm-openai:latest args: - "--model" - "meta-llama/Llama-3.1-8B-Instruct" - "--tensor-parallel-size" - "1" - "--gpu-memory-utilization" - "0.90" - "--max-num-seqs" - "128" resources: requests: nvidia.com/gpu: "1" memory: "20Gi" cpu: "4" limits: nvidia.com/gpu: "1" memory: "24Gi" cpu: "8" ports: - containerPort: 8000 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 10 env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token key: token
KEDA Autoscaling on Prometheus Metrics
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: vllm-scaledobject spec: scaleTargetRef: name: vllm-llama-8b minReplicaCount: 1 maxReplicaCount: 8 cooldownPeriod: 300 # 5 min before scale-down pollingInterval: 15 triggers:
- type: prometheus metadata: serverAddress: http://prometheus-server.monitoring:9090 metricName: vllm_num_requests_waiting threshold: "10" # scale up if >10 requests waiting query: | sum(vllm:num_requests_waiting{deployment="vllm-llama-8b"})
- type: prometheus metadata: serverAddress: http://prometheus-server.monitoring:9090 metricName: vllm_gpu_cache_usage threshold: "0.8" # scale up if KV cache >80% full query: | avg(vllm:gpu_cache_usage_perc{deployment="vllm-llama-8b"})
Queue-Based Scaling (Redis + KEDA)
ScaledJob for async batch inference
apiVersion: keda.sh/v1alpha1 kind: ScaledJob metadata: name: llm-batch-inference spec: jobTargetRef: template: spec: containers: - name: inference-worker image: myapp/inference-worker:latest env: - name: REDIS_URL value: redis://redis:6379 - name: QUEUE_NAME value: inference-jobs restartPolicy: OnFailure minReplicaCount: 0 maxReplicaCount: 20 pollingInterval: 5 successfulJobsHistoryLimit: 3 triggers:
- type: redis metadata: address: redis:6379 listName: inference-jobs listLength: "5" # 1 worker per 5 queued jobs
Spot Instance Strategy
Mixed node pool: on-demand + spot GPUs
apiVersion: v1 kind: ConfigMap metadata: name: cluster-autoscaler-priority-config data: priorities: | 10: # low priority = prefer - .spot. 50: - .on-demand.
Node affinity for spot with on-demand fallback
spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 80 preference: matchExpressions: - key: node.kubernetes.io/lifecycle operator: In values: [spot] - weight: 20 preference: matchExpressions: - key: node.kubernetes.io/lifecycle operator: In values: [on-demand]
Cluster Autoscaler for GPU Nodes
AWS EKS — enable cluster autoscaler for GPU node group
helm install cluster-autoscaler autoscaler/cluster-autoscaler
--namespace kube-system
--set autoDiscovery.clusterName=my-cluster
--set awsRegion=us-east-1
--set rbac.serviceAccount.annotations."eks.amazonaws.com/role-arn"=arn:aws:iam::ACCOUNT:role/ClusterAutoscalerRole
--set extraArgs.skip-nodes-with-local-storage=false
--set extraArgs.expander=least-waste
Annotate GPU node group for autoscaler
kubectl annotate node <node>
cluster-autoscaler.kubernetes.io/safe-to-evict="false"
Scaling Metrics to Monitor
Prometheus queries for scaling decisions
Requests waiting in vLLM queue
sum(vllm:num_requests_waiting) by (model)
GPU KV cache utilization (>80% = bottleneck)
avg(vllm:gpu_cache_usage_perc) by (pod)
Tokens per second throughput
sum(rate(vllm:generation_tokens_total[5m])) by (model)
P99 time-to-first-token
histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))
Common Issues
Issue Cause Fix
Pods stuck in Pending
No GPU nodes available Check cluster autoscaler logs; verify node group limits
Scale-up too slow Cluster autoscaler delay + model load time Pre-warm replicas; increase minReplicaCount
GPU fragmentation Multiple small models on large GPUs Use MIG partitioning or consolidate model sizes
Spot eviction causes errors Spot instance reclamation Add PodDisruptionBudget ; use graceful shutdown
KEDA not scaling Prometheus query returns no data Test query in Prometheus UI first
Best Practices
-
Set minReplicaCount: 1 to avoid cold starts; scale to 0 only for batch jobs.
-
Use PodDisruptionBudget with minAvailable: 1 to survive spot evictions.
-
Pre-pull model weights into a shared PVC to speed up pod startup by 5–10×.
-
Separate model families across node pools (A10G for 7B, A100 for 70B).
-
Use Kubernetes VPA for CPU/memory right-sizing alongside KEDA for replica count.
Related Skills
-
vllm-server - vLLM configuration and tuning
-
gpu-server-management - GPU node setup
-
model-serving-kubernetes - KServe
-
kubernetes-ops - Core Kubernetes
-
llm-cost-optimization - Cost strategies