vllm-deployment

Deploy vLLM for high-performance LLM inference. Covers Docker CPU/GPU deployments and cloud VM provisioning with OpenAI-compatible API endpoints.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "vllm-deployment" with this command: npx skills add stakpak/community-paks/stakpak-community-paks-vllm-deployment

vLLM Model Serving and Inference

Quick Start

Docker (CPU)

docker run --rm -p 8000:8000 \
  --shm-size=4g \
  --cap-add SYS_NICE \
  --security-opt seccomp=unconfined \
  -e VLLM_CPU_KVCACHE_SPACE=4 \
  <vllm-cpu-image> \
  --model <model-name> \
  --dtype float32
# Access: http://localhost:8000

Docker (GPU)

docker run --rm -p 8000:8000 \
  --gpus all \
  --shm-size=4g \
  <vllm-gpu-image> \
  --model <model-name>
# Access: http://localhost:8000

Docker Deployment

1. Assess Hardware Requirements

HardwareMinimum RAMRecommended
CPU2x model size4x model size
GPUModel size + 2GBModel size + 4GB VRAM
  • Check model documentation for specific requirements
  • Consider quantized variants to reduce memory footprint
  • Allocate 50-100GB storage for model downloads

2. Pull the Container Image

# CPU image (check vLLM docs for latest tag)
docker pull <vllm-cpu-image>

# GPU image (check vLLM docs for latest tag)
docker pull <vllm-gpu-image>

Notes:

  • Use CPU-specific images for CPU inference
  • Use CUDA-enabled images matching your GPU architecture
  • Verify CPU instruction set compatibility (AVX512, AVX2)

3. Configure and Run

CPU Deployment:

docker run --rm \
  --shm-size=4g \
  --cap-add SYS_NICE \
  --security-opt seccomp=unconfined \
  -p 8000:8000 \
  -e VLLM_CPU_KVCACHE_SPACE=4 \
  -e VLLM_CPU_OMP_THREADS_BIND=0-7 \
  <vllm-cpu-image> \
  --model <model-name> \
  --dtype float32 \
  --max-model-len 2048

GPU Deployment:

docker run --rm \
  --gpus all \
  --shm-size=4g \
  -p 8000:8000 \
  <vllm-gpu-image> \
  --model <model-name> \
  --dtype auto \
  --max-model-len 4096

4. Verify Deployment

# Check health
curl http://localhost:8000/health

# List models
curl http://localhost:8000/v1/models

# Test inference
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "<model-name>", "prompt": "Hello", "max_tokens": 10}'

5. Update

docker pull <vllm-image>
docker stop <container-id>
# Re-run with same parameters

Cloud VM Deployment

1. Provision Infrastructure

# Create security group with rules:
# - TCP 22 (SSH)
# - TCP 8000 (API)

# Launch instance with:
# - Sufficient RAM/VRAM for model
# - Docker pre-installed (or install manually)
# - 50-100GB root volume
# - Public IP for external access

2. Connect and Deploy

ssh -i <key-file> <user>@<instance-ip>

# Install Docker if not present
# Pull and run vLLM container (see Docker Deployment section)

3. Verify External Access

# From local machine
curl http://<instance-ip>:8000/health
curl http://<instance-ip>:8000/v1/models

4. Cleanup

# Stop container
docker stop <container-id>

# Terminate instance to stop costs
# Delete associated resources (volumes, security groups) if temporary

Configuration Reference

Environment Variables

VariablePurposeExample
VLLM_CPU_KVCACHE_SPACEKV cache size in GB (CPU)4
VLLM_CPU_OMP_THREADS_BINDCPU core binding (CPU)0-7
CUDA_VISIBLE_DEVICESGPU device selection0,1
HF_TOKENHuggingFace authenticationhf_xxx

Docker Flags

FlagPurpose
--shm-size=4gShared memory for IPC
--cap-add SYS_NICENUMA optimization (CPU)
--security-opt seccomp=unconfinedMemory policy syscalls (CPU)
--gpus allGPU access
-p 8000:8000Port mapping

vLLM Arguments

ArgumentPurposeExample
--modelModel name/path<model-name>
--dtypeData typefloat32, auto, bfloat16
--max-model-lenMax context length2048
--tensor-parallel-sizeMulti-GPU parallelism2

API Endpoints

EndpointMethodPurpose
/healthGETHealth check
/v1/modelsGETList available models
/v1/completionsPOSTText completion
/v1/chat/completionsPOSTChat completion
/metricsGETPrometheus metrics

Production Checklist

  • Verify model fits in available memory
  • Configure appropriate data type for hardware
  • Set up firewall/security group rules
  • Test API endpoints before production use
  • Configure monitoring (Prometheus metrics)
  • Set up health check alerts
  • Document model and configuration used
  • Plan for model updates and rollbacks

Troubleshooting

IssueSolution
Container exits immediatelyIncrease RAM or use smaller model
Slow inference (CPU)Verify OMP thread binding configuration
Connection refused externallyCheck firewall/security group rules
Model download failsSet HF_TOKEN for gated models
Out of memory during inferenceReduce max_model_len or batch size
Port already in useChange host port mapping
Warmup takes too longNormal for large models (1-5 min)

References

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

terrateam-usage-guide

No summary provided by upstream source.

Repository SourceNeeds Review
General

cloudflare-workers

No summary provided by upstream source.

Repository SourceNeeds Review
General

cloudflare-tunnel-ec2-deployment

No summary provided by upstream source.

Repository SourceNeeds Review
General

coolify-deployment

No summary provided by upstream source.

Repository SourceNeeds Review