llm-inference-scaling

LLM Inference Scaling

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "llm-inference-scaling" with this command: npx skills add bagelhole/devops-security-agent-skills/bagelhole-devops-security-agent-skills-llm-inference-scaling

LLM Inference Scaling

Scale LLM inference horizontally on Kubernetes with GPU-aware autoscaling, request queuing, and cost-efficient spot instance strategies.

When to Use This Skill

Use this skill when:

  • LLM API traffic is unpredictable and you need to scale up/down automatically

  • Managing a fleet of vLLM or TGI inference pods on Kubernetes

  • Reducing inference costs with spot/preemptible GPU instances

  • Implementing queue-based autoscaling for batch inference jobs

  • Building a multi-model serving platform that shares GPU resources

Prerequisites

  • Kubernetes cluster with GPU nodes (NVIDIA operator installed)

  • KEDA (Kubernetes Event-Driven Autoscaler) installed

  • Prometheus with GPU metrics (dcgm-exporter or gpu-operator )

  • Helm 3+ for chart deployments

GPU Node Setup

Install NVIDIA GPU Operator (handles drivers, container toolkit, DCGM)

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update

helm install gpu-operator nvidia/gpu-operator
--namespace gpu-operator
--create-namespace
--set driver.enabled=true
--set dcgm.enabled=true
--set devicePlugin.enabled=true

Verify GPU nodes are recognized

kubectl get nodes -l nvidia.com/gpu.present=true kubectl describe node <gpu-node> | grep nvidia

vLLM Deployment with GPU Resources

apiVersion: apps/v1 kind: Deployment metadata: name: vllm-llama-8b labels: app: vllm model: llama-3.1-8b spec: replicas: 1 selector: matchLabels: app: vllm model: llama-3.1-8b template: metadata: labels: app: vllm model: llama-3.1-8b spec: nodeSelector: nvidia.com/gpu.present: "true" tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: vllm image: vllm/vllm-openai:latest args: - "--model" - "meta-llama/Llama-3.1-8B-Instruct" - "--tensor-parallel-size" - "1" - "--gpu-memory-utilization" - "0.90" - "--max-num-seqs" - "128" resources: requests: nvidia.com/gpu: "1" memory: "20Gi" cpu: "4" limits: nvidia.com/gpu: "1" memory: "24Gi" cpu: "8" ports: - containerPort: 8000 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 10 env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token key: token

KEDA Autoscaling on Prometheus Metrics

apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: vllm-scaledobject spec: scaleTargetRef: name: vllm-llama-8b minReplicaCount: 1 maxReplicaCount: 8 cooldownPeriod: 300 # 5 min before scale-down pollingInterval: 15 triggers:

  • type: prometheus metadata: serverAddress: http://prometheus-server.monitoring:9090 metricName: vllm_num_requests_waiting threshold: "10" # scale up if >10 requests waiting query: | sum(vllm:num_requests_waiting{deployment="vllm-llama-8b"})
  • type: prometheus metadata: serverAddress: http://prometheus-server.monitoring:9090 metricName: vllm_gpu_cache_usage threshold: "0.8" # scale up if KV cache >80% full query: | avg(vllm:gpu_cache_usage_perc{deployment="vllm-llama-8b"})

Queue-Based Scaling (Redis + KEDA)

ScaledJob for async batch inference

apiVersion: keda.sh/v1alpha1 kind: ScaledJob metadata: name: llm-batch-inference spec: jobTargetRef: template: spec: containers: - name: inference-worker image: myapp/inference-worker:latest env: - name: REDIS_URL value: redis://redis:6379 - name: QUEUE_NAME value: inference-jobs restartPolicy: OnFailure minReplicaCount: 0 maxReplicaCount: 20 pollingInterval: 5 successfulJobsHistoryLimit: 3 triggers:

  • type: redis metadata: address: redis:6379 listName: inference-jobs listLength: "5" # 1 worker per 5 queued jobs

Spot Instance Strategy

Mixed node pool: on-demand + spot GPUs

apiVersion: v1 kind: ConfigMap metadata: name: cluster-autoscaler-priority-config data: priorities: | 10: # low priority = prefer - .spot. 50: - .on-demand.

Node affinity for spot with on-demand fallback

spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 80 preference: matchExpressions: - key: node.kubernetes.io/lifecycle operator: In values: [spot] - weight: 20 preference: matchExpressions: - key: node.kubernetes.io/lifecycle operator: In values: [on-demand]

Cluster Autoscaler for GPU Nodes

AWS EKS — enable cluster autoscaler for GPU node group

helm install cluster-autoscaler autoscaler/cluster-autoscaler
--namespace kube-system
--set autoDiscovery.clusterName=my-cluster
--set awsRegion=us-east-1
--set rbac.serviceAccount.annotations."eks.amazonaws.com/role-arn"=arn:aws:iam::ACCOUNT:role/ClusterAutoscalerRole
--set extraArgs.skip-nodes-with-local-storage=false
--set extraArgs.expander=least-waste

Annotate GPU node group for autoscaler

kubectl annotate node <node>
cluster-autoscaler.kubernetes.io/safe-to-evict="false"

Scaling Metrics to Monitor

Prometheus queries for scaling decisions

Requests waiting in vLLM queue

sum(vllm:num_requests_waiting) by (model)

GPU KV cache utilization (>80% = bottleneck)

avg(vllm:gpu_cache_usage_perc) by (pod)

Tokens per second throughput

sum(rate(vllm:generation_tokens_total[5m])) by (model)

P99 time-to-first-token

histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))

Common Issues

Issue Cause Fix

Pods stuck in Pending

No GPU nodes available Check cluster autoscaler logs; verify node group limits

Scale-up too slow Cluster autoscaler delay + model load time Pre-warm replicas; increase minReplicaCount

GPU fragmentation Multiple small models on large GPUs Use MIG partitioning or consolidate model sizes

Spot eviction causes errors Spot instance reclamation Add PodDisruptionBudget ; use graceful shutdown

KEDA not scaling Prometheus query returns no data Test query in Prometheus UI first

Best Practices

  • Set minReplicaCount: 1 to avoid cold starts; scale to 0 only for batch jobs.

  • Use PodDisruptionBudget with minAvailable: 1 to survive spot evictions.

  • Pre-pull model weights into a shared PVC to speed up pod startup by 5–10×.

  • Separate model families across node pools (A10G for 7B, A100 for 70B).

  • Use Kubernetes VPA for CPU/memory right-sizing alongside KEDA for replica count.

Related Skills

  • vllm-server - vLLM configuration and tuning

  • gpu-server-management - GPU node setup

  • model-serving-kubernetes - KServe

  • kubernetes-ops - Core Kubernetes

  • llm-cost-optimization - Cost strategies

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

linux-administration

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-hardening

No summary provided by upstream source.

Repository SourceNeeds Review
Security

sops-encryption

No summary provided by upstream source.

Repository SourceNeeds Review
Security

vpn-setup

No summary provided by upstream source.

Repository SourceNeeds Review