AI/ML Infrastructure
Model serving with KubeAI, GPU scheduling, and inference patterns.
Model Deployment Options
Feature KubeAI Ollama Operator LlamaStack
Backend vLLM (GPU optimized) Ollama (easy) Multi-backend
Scale from zero Yes No No
OpenAI API Native Compatible Compatible
Best for Production GPU CPU/mixed Full AI stack
KubeAI Setup
Model CRD
apiVersion: kubeai.org/v1 kind: Model metadata: name: llama-3-8b namespace: kubeai spec: features: [TextGeneration] url: "ollama://llama3.1:8b" engine: OLlama resourceProfile: nvidia-gpu-l4:1 minReplicas: 0 # Scale to zero maxReplicas: 3 targetRequests: 10 # Scale up threshold
Resource Profiles
Profile GPUs VRAM Use Case
cpu
0
Embeddings, small models
nvidia-gpu-l4:1
1x L4 24GB 8B models
nvidia-gpu-h100:1
1x H100 80GB 70B models
nvidia-gpu-h100:2
2x H100 160GB Large models
Custom Resource Profile
resourceProfiles: nvidia-gpu-l4: nodeSelector: nvidia.com/gpu.product: "NVIDIA-L4" requests: cpu: "4" memory: "16Gi" limits: nvidia.com/gpu: "1" cpu: "8" memory: "32Gi"
Accessing Models
OpenAI-Compatible API
Port-forward
kubectl port-forward svc/kubeai -n kubeai 8000:80
List models
curl http://localhost:8000/openai/v1/models
Chat completion
curl http://localhost:8000/openai/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "llama-3-8b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
In-Cluster Access
env:
- name: OPENAI_API_BASE value: "http://kubeai.kubeai.svc/openai/v1"
- name: OPENAI_API_KEY value: "not-needed" # KubeAI doesn't require auth
SDK Usage
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://kubeai.kubeai.svc/openai/v1", apiKey: "not-needed", });
const response = await client.chat.completions.create({ model: "llama-3-8b", messages: [{ role: "user", content: "Hello!" }], });
GPU Operator
NVIDIA GPU Operator manages GPU drivers and device plugins.
Verify GPU Nodes
Check GPU nodes
kubectl get nodes -l nvidia.com/gpu.product
Check GPU allocations
kubectl describe node <gpu-node> | grep nvidia.com/gpu
Check device plugin
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
GPU Pod Scheduling
spec: containers: - name: gpu-app resources: limits: nvidia.com/gpu: 1 nodeSelector: nvidia.com/gpu.product: "NVIDIA-L4"
Model Selection Guide
Model Size GPU Req Best For
llama3.1:8b
8B L4 x1 General, coding
llama3.1:70b
70B H100 x2 Complex reasoning
qwen2.5-coder
7B L4 x1 Code generation
nomic-embed-text
137M CPU Embeddings
deepseek-r1
1.5B CPU Light reasoning
Ollama Operator (Alternative)
Simpler setup for Ollama models:
apiVersion: ollama.ayaka.io/v1 kind: Model metadata: name: phi4 namespace: ollama-operator-system spec: image: phi4 resources: limits: nvidia.com/gpu: "1"
Access:
kubectl port-forward svc/ollama-model-phi4 -n ollama-operator-system 11434:11434 ollama run phi4
Validation Commands
Check KubeAI models
kubectl get models -n kubeai kubectl describe model <name> -n kubeai
Check model pods
kubectl get pods -n kubeai -l app.kubernetes.io/name=kubeai
Check GPU utilization
kubectl exec -n kubeai <pod> -- nvidia-smi
Test API
curl http://kubeai.kubeai.svc/openai/v1/models
Troubleshooting
Model not starting
Check model status
kubectl describe model <name> -n kubeai
Check pod events
kubectl get events -n kubeai --sort-by='.lastTimestamp'
Check logs
kubectl logs -n kubeai -l model=<name>
Out of memory (OOM)
Reduce model parameters:
spec: args: - --max-model-len=4096 # Reduce from 8192 - --gpu-memory-utilization=0.8 # Reduce from 0.9
Slow first response
Set minReplicas to keep model warm:
spec: minReplicas: 1 # Always keep one running
Best Practices
-
Use scale-from-zero - Set minReplicas: 0 to save resources
-
Right-size GPU profiles - Don't over-allocate expensive GPUs
-
Use vLLM for production - Better throughput than Ollama
-
Monitor GPU memory - Set appropriate gpu-memory-utilization
-
Keep frequently-used models warm - minReplicas: 1
-
Use OpenAI-compatible API - Easy integration with existing code