Model Serving on Kubernetes

Production ML model serving with KServe and Triton — canary deployments, autoscaling, and GPU-aware scheduling.

When to Use This Skill

Use this skill when:

Serving scikit-learn, PyTorch, TensorFlow, or ONNX models at scale
Implementing canary deployments and A/B testing for ML models
Autoscaling inference pods based on request rate or GPU metrics
Deploying LLMs with Triton or KServe on Kubernetes
Managing multiple model versions with traffic splitting

Prerequisites

Kubernetes 1.28+ with GPU nodes
KServe installed (or Triton standalone)
kubectl and helm configured
NVIDIA GPU Operator installed on cluster

KServe Installation

Install KServe with Helm

helm repo add kserve https://kserve.github.io/helm-charts helm repo update

helm install kserve kserve/kserve
--namespace kserve
--create-namespace
--set kserve.controller.gateway.ingressGateway.className=nginx

Verify

kubectl get pods -n kserve kubectl get crd | grep kserve

Basic InferenceService (KServe)

apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: sklearn-iris namespace: models spec: predictor: sklearn: storageUri: gs://kfserving-examples/models/sklearn/1.0/model resources: requests: cpu: "1" memory: 2Gi limits: cpu: "2" memory: 4Gi

kubectl apply -f inference-service.yaml

Get inference service URL

kubectl get inferenceservice sklearn-iris -n models

NAME URL READY ...

sklearn-iris http://sklearn-iris.models.example.com True

Test prediction

curl -X POST http://sklearn-iris.models.example.com/v1/models/sklearn-iris:predict
-H "Content-Type: application/json"
-d '{"instances": [[6.8, 2.8, 4.8, 1.4]]}'

GPU-Enabled LLM InferenceService

apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: llama-3-8b namespace: models annotations: serving.kserve.io/enable-prometheus-scraping: "true" spec: predictor: containers: - name: vllm-container image: vllm/vllm-openai:latest args: - "--model" - "meta-llama/Llama-3.1-8B-Instruct" - "--tensor-parallel-size" - "1" - "--gpu-memory-utilization" - "0.90" ports: - containerPort: 8080 protocol: TCP resources: requests: nvidia.com/gpu: "1" memory: "20Gi" cpu: "4" limits: nvidia.com/gpu: "1" memory: "24Gi" cpu: "8" readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 60 periodSeconds: 10 env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token key: token nodeSelector: nvidia.com/gpu.present: "true" transformer: containers: - name: kserve-container image: kserve/kserve-transformer:latest

Canary Deployment (Traffic Splitting)

apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: llama-3-8b namespace: models spec: predictor: canaryTrafficPercent: 20 # 20% to new version, 80% to stable containers: - name: vllm-container image: vllm/vllm-openai:latest args: - "--model" - "meta-llama/Llama-3.1-8B-Instruct-v2" # new model version resources: limits: nvidia.com/gpu: "1"

Gradually increase canary traffic

kubectl patch inferenceservice llama-3-8b -n models
--type='json'
-p='[{"op":"replace","path":"/spec/predictor/canaryTrafficPercent","value":50}]'

Promote canary to stable

kubectl patch inferenceservice llama-3-8b -n models
--type='json'
-p='[{"op":"remove","path":"/spec/predictor/canaryTrafficPercent"}]'

Autoscaling with KEDA

apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: llama-scaler namespace: models spec: scaleTargetRef: apiVersion: serving.kserve.io/v1beta1 kind: InferenceService name: llama-3-8b minReplicaCount: 1 maxReplicaCount: 5 triggers:

type: prometheus metadata: serverAddress: http://prometheus-server.monitoring:9090 metricName: kserve_request_count threshold: "10" query: | sum(rate(kserve_request_count_total{namespace="models", service="llama-3-8b"}[1m]))

NVIDIA Triton Inference Server

apiVersion: apps/v1 kind: Deployment metadata: name: triton-server namespace: models spec: replicas: 2 selector: matchLabels: app: triton template: metadata: labels: app: triton spec: containers: - name: triton image: nvcr.io/nvidia/tritonserver:24.05-py3 args: - "tritonserver" - "--model-store=s3://my-model-store/models" - "--model-control-mode=poll" # auto-load new model versions - "--repository-poll-secs=30" - "--metrics-port=8002" ports: - containerPort: 8000 # HTTP - containerPort: 8001 # gRPC - containerPort: 8002 # Metrics resources: limits: nvidia.com/gpu: "1" readinessProbe: httpGet: path: /v2/health/ready port: 8000 initialDelaySeconds: 30

Triton Model Repository Structure

s3://my-model-store/models/ ├── text-classifier/ │ ├── config.pbtxt │ ├── 1/ │ │ └── model.onnx │ └── 2/ │ └── model.onnx # new version; auto-loaded ├── embedding-model/ │ ├── config.pbtxt │ └── 1/ │ └── model.onnx

config.pbtxt for ONNX model

name: "text-classifier" backend: "onnxruntime" max_batch_size: 64 dynamic_batching { preferred_batch_size: [16, 32] max_queue_delay_microseconds: 1000 } input [ { name: "input_ids" data_type: TYPE_INT64 dims: [-1] } { name: "attention_mask" data_type: TYPE_INT64 dims: [-1] } ] output [ { name: "logits" data_type: TYPE_FP32 dims: [-1] } ] instance_group [ { kind: KIND_GPU count: 2 } # 2 model instances on GPU ]

Model Management Commands

List loaded models (Triton)

curl http://triton:8000/v2/models

Load a new model version

curl -X POST http://triton:8000/v2/repository/models/text-classifier/load

Unload a model

curl -X POST http://triton:8000/v2/repository/models/text-classifier/unload

KServe — watch rollout status

kubectl rollout status deployment/llama-3-8b-predictor -n models kubectl get inferenceservice llama-3-8b -n models -w

Common Issues

Issue Cause Fix

InferenceService not ready

Model loading or OOM Check predictor pod logs; increase memory limits

Canary stuck at 0% KNative routing issue Check kubectl get ksvc -n models

Triton missing model S3 permissions or path Verify IAM role; check --model-store path

Low GPU utilization Dynamic batching off Enable dynamic_batching in Triton config

Autoscaler not triggering Prometheus query wrong Test query in Prometheus UI

Best Practices

Use canary deployments for all model updates — roll back in seconds if metrics degrade.
Enable Triton dynamic batching — it can increase GPU throughput 5–10× for small models.
Store models in S3/GCS with versioned paths (s3://bucket/model/v1/ , v2/ ).
Pin GPU node selectors to prevent model pods landing on CPU-only nodes.
Monitor p99 latency and error rates per model version during canary rollouts.

Related Skills

vllm-server - vLLM for LLM serving
llm-inference-scaling - KEDA autoscaling
kubernetes-ops - Core Kubernetes operations
gpu-server-management - GPU nodes

model-serving-kubernetes

Safety Notice

Copy this and send it to your AI assistant to learn