vllm-server

vLLM Server Management

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "vllm-server" with this command: npx skills add bagelhole/devops-security-agent-skills/bagelhole-devops-security-agent-skills-vllm-server

vLLM Server Management

Deploy production-grade LLM inference servers with vLLM — the fastest open-source LLM serving engine with PagedAttention and continuous batching.

When to Use This Skill

Use this skill when:

  • Serving open-source LLMs (Llama, Mistral, Qwen, Gemma) at scale

  • Building an OpenAI-compatible API endpoint for self-hosted models

  • Optimizing LLM throughput and latency for production traffic

  • Running multi-GPU inference with tensor or pipeline parallelism

  • Deploying quantized models to reduce GPU memory requirements

Prerequisites

  • NVIDIA GPU(s) with CUDA 12.1+ (A100/H100 recommended for production)

  • Docker or Python 3.9+ with pip

  • 40GB+ VRAM for 70B models; 8GB+ for 7B models

  • nvidia-container-toolkit for Docker GPU passthrough

Quick Start

Install vLLM

pip install vllm

Serve a model (OpenAI-compatible API)

vllm serve meta-llama/Llama-3.1-8B-Instruct
--host 0.0.0.0
--port 8000
--api-key your-secret-key

Test the endpoint

curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer your-secret-key"
-d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}] }'

Docker Deployment

docker run --runtime nvidia --gpus all
-v ~/.cache/huggingface:/root/.cache/huggingface
-p 8000:8000
--ipc=host
vllm/vllm-openai:latest
--model meta-llama/Llama-3.1-8B-Instruct
--api-key your-secret-key

Docker Compose (Production)

services: vllm: image: vllm/vllm-openai:latest runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=all - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} volumes: - model-cache:/root/.cache/huggingface ports: - "8000:8000" ipc: host command: > --model meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 2 --max-model-len 32768 --gpu-memory-utilization 0.90 --api-key ${VLLM_API_KEY} restart: unless-stopped healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 30s timeout: 10s retries: 3

volumes: model-cache:

Key Configuration Options

Multi-GPU Tensor Parallelism

Split one model across 4 GPUs

vllm serve meta-llama/Llama-3.1-70B-Instruct
--tensor-parallel-size 4
--gpu-memory-utilization 0.90

Quantization (Lower VRAM)

AWQ quantization (70B on 2x A100 40GB)

vllm serve casperhansen/llama-3-70b-instruct-awq
--quantization awq
--tensor-parallel-size 2

GPTQ quantization

vllm serve TheBloke/Llama-2-70B-Chat-GPTQ
--quantization gptq

FP8 (H100 NVL native)

vllm serve meta-llama/Llama-3.1-405B-Instruct
--quantization fp8
--tensor-parallel-size 8

Structured Output & Tools

vllm serve meta-llama/Llama-3.1-8B-Instruct
--enable-auto-tool-choice
--tool-call-parser llama3_json
--guided-decoding-backend outlines

LoRA Adapters

vllm serve meta-llama/Llama-3.1-8B-Instruct
--enable-lora
--lora-modules sql-lora=/path/to/sql-lora
code-lora=/path/to/code-lora
--max-lora-rank 64

Performance Tuning

Maximize throughput for batch workloads

vllm serve <model>
--max-num-seqs 256 \ # max concurrent sequences --max-num-batched-tokens 8192 \ # tokens per batch --gpu-memory-utilization 0.95 \ # use 95% VRAM --swap-space 4 # CPU swap (GiB)

Minimize latency for interactive use

vllm serve <model>
--max-num-seqs 32
--enforce-eager # disable CUDA graph capture

Benchmarking

Install benchmark tool

pip install vllm

Run throughput benchmark

python -m vllm.entrypoints.openai.run_batch
--model meta-llama/Llama-3.1-8B-Instruct
--input-file prompts.jsonl
--output-file results.jsonl

Benchmark with vllm bench

vllm bench throughput
--model meta-llama/Llama-3.1-8B-Instruct
--num-prompts 1000
--input-len 512
--output-len 128

Monitoring

Check running server stats

curl http://localhost:8000/metrics # Prometheus metrics

Key metrics to watch:

vllm:num_requests_running - active requests

vllm:gpu_cache_usage_perc - KV cache utilization

vllm:generation_tokens_per_s - throughput

vllm:time_to_first_token_ms - TTFT latency

vllm:e2e_request_latency_seconds - end-to-end latency

Common Issues

Issue Cause Fix

CUDA out of memory

Model too large for VRAM Add --quantization awq or reduce --gpu-memory-utilization

Slow cold start Model not cached Pre-pull with huggingface-cli download <model>

Low throughput Too few concurrent requests Increase --max-num-seqs

KV cache full errors Context length too long Set --max-model-len lower

tokenizer error

Tokenizer mismatch Use --tokenizer to specify correct tokenizer

Best Practices

  • Use --gpu-memory-utilization 0.90 to leave headroom for CUDA kernels.

  • Pin model versions with --revision for reproducible deployments.

  • Set HF_HUB_OFFLINE=1 in production to prevent unexpected downloads.

  • Use AWQ or GPTQ quantization before tensor parallelism — lower VRAM first.

  • Enable --enable-chunked-prefill for long-context workloads.

  • Monitor gpu_cache_usage_perc — above 95% causes queuing.

Related Skills

  • llm-inference-scaling - Auto-scaling vLLM deployments

  • gpu-server-management - GPU driver setup

  • llm-gateway - Load balancing across vLLM instances

  • llm-cost-optimization - Cost management

  • model-serving-kubernetes - K8s deployment

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

linux-administration

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-hardening

No summary provided by upstream source.

Repository SourceNeeds Review
Security

sops-encryption

No summary provided by upstream source.

Repository SourceNeeds Review
Security

vpn-setup

No summary provided by upstream source.

Repository SourceNeeds Review