vLLM - High-Performance LLM Serving

Quick start

vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).

Installation:

pip install vllm

Basic offline inference:

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-8B-Instruct") sampling = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(["Explain quantum computing"], sampling) print(outputs[0].outputs[0].text)

OpenAI-compatible server:

vllm serve meta-llama/Llama-3-8B-Instruct

Query with OpenAI SDK

python -c " from openai import OpenAI client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY') print(client.chat.completions.create( model='meta-llama/Llama-3-8B-Instruct', messages=[{'role': 'user', 'content': 'Hello!'}] ).choices[0].message.content) "

Common workflows

Workflow 1: Production API deployment

Copy this checklist and track progress:

Deployment Progress:

Step 1: Configure server settings
Step 2: Test with limited traffic
Step 3: Enable monitoring
Step 4: Deploy to production
Step 5: Verify performance metrics

Step 1: Configure server settings

Choose configuration based on your model size:

For 7B-13B models on single GPU

vllm serve meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--max-model-len 8192
--port 8000

For 30B-70B models with tensor parallelism

vllm serve meta-llama/Llama-2-70b-hf
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--quantization awq
--port 8000

For production with caching and metrics

vllm serve meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching
--enable-metrics
--metrics-port 9090
--port 8000
--host 0.0.0.0

Step 2: Test with limited traffic

Run load test before production:

Install load testing tool

pip install locust

Create test_load.py with sample requests

Run: locust -f test_load.py --host http://localhost:8000

Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.

Step 3: Enable monitoring

vLLM exposes Prometheus metrics on port 9090:

curl http://localhost:9090/metrics | grep vllm

Key metrics to monitor:

vllm:time_to_first_token_seconds
Latency
vllm:num_requests_running
Active requests
vllm:gpu_cache_usage_perc
KV cache utilization

Step 4: Deploy to production

Use Docker for consistent deployment:

Run vLLM in Docker

docker run --gpus all -p 8000:8000
vllm/vllm-openai:latest
--model meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching

Step 5: Verify performance metrics

Check that deployment meets targets:

TTFT < 500ms (for short prompts)
Throughput > target req/sec
GPU utilization > 80%
No OOM errors in logs

Workflow 2: Offline batch inference

For processing large datasets without server overhead.

Copy this checklist:

Batch Processing:

Step 1: Prepare input data
Step 2: Configure LLM engine
Step 3: Run batch inference
Step 4: Process results

Step 1: Prepare input data

Load prompts from file

prompts = [] with open("prompts.txt") as f: prompts = [line.strip() for line in f]

print(f"Loaded {len(prompts)} prompts")

Step 2: Configure LLM engine

from vllm import LLM, SamplingParams

llm = LLM( model="meta-llama/Llama-3-8B-Instruct", tensor_parallel_size=2, # Use 2 GPUs gpu_memory_utilization=0.9, max_model_len=4096 )

sampling = SamplingParams( temperature=0.7, top_p=0.95, max_tokens=512, stop=["</s>", "\n\n"] )

Step 3: Run batch inference

vLLM automatically batches requests for efficiency:

Process all prompts in one call

outputs = llm.generate(prompts, sampling)

vLLM handles batching internally

No need to manually chunk prompts

Step 4: Process results

Extract generated text

results = [] for output in outputs: prompt = output.prompt generated = output.outputs[0].text results.append({ "prompt": prompt, "generated": generated, "tokens": len(output.outputs[0].token_ids) })

Save to file

import json with open("results.jsonl", "w") as f: for result in results: f.write(json.dumps(result) + "\n")

print(f"Processed {len(results)} prompts")

Workflow 3: Quantized model serving

Fit large models in limited GPU memory.

Quantization Setup:

Step 1: Choose quantization method
Step 2: Find or create quantized model
Step 3: Launch with quantization flag
Step 4: Verify accuracy

Step 1: Choose quantization method

AWQ: Best for 70B models, minimal accuracy loss
GPTQ: Wide model support, good compression
FP8: Fastest on H100 GPUs

Step 2: Find or create quantized model

Use pre-quantized models from HuggingFace:

Search for AWQ models

Example: TheBloke/Llama-2-70B-AWQ

Step 3: Launch with quantization flag

Using pre-quantized model

vllm serve TheBloke/Llama-2-70B-AWQ
--quantization awq
--tensor-parallel-size 1
--gpu-memory-utilization 0.95

Results: 70B model in ~40GB VRAM

Step 4: Verify accuracy

Test outputs match expected quality:

Compare quantized vs non-quantized responses

Verify task-specific performance unchanged

When to use vs alternatives

Use vLLM when:

Deploying production LLM APIs (100+ req/sec)
Serving OpenAI-compatible endpoints
Limited GPU memory but need large models
Multi-user applications (chatbots, assistants)
Need low latency with high throughput

Use alternatives instead:

llama.cpp: CPU/edge inference, single-user
HuggingFace transformers: Research, prototyping, one-off generation
TensorRT-LLM: NVIDIA-only, need absolute maximum performance
Text-Generation-Inference: Already in HuggingFace ecosystem

Common issues

Issue: Out of memory during model loading

Reduce memory usage:

vllm serve MODEL
--gpu-memory-utilization 0.7
--max-model-len 4096

Or use quantization:

vllm serve MODEL --quantization awq

Issue: Slow first token (TTFT > 1 second)

Enable prefix caching for repeated prompts:

vllm serve MODEL --enable-prefix-caching

For long prompts, enable chunked prefill:

vllm serve MODEL --enable-chunked-prefill

Issue: Model not found error

Use --trust-remote-code for custom models:

vllm serve MODEL --trust-remote-code

Issue: Low throughput (<50 req/sec)

Increase concurrent sequences:

vllm serve MODEL --max-num-seqs 512

Check GPU utilization with nvidia-smi

should be >80%.

Issue: Inference slower than expected

Verify tensor parallelism uses power of 2 GPUs:

vllm serve MODEL --tensor-parallel-size 4 # Not 3

Enable speculative decoding for faster generation:

vllm serve MODEL --speculative-model DRAFT_MODEL

Advanced topics

Server deployment patterns: See references/server-deployment.md for Docker, Kubernetes, and load balancing configurations.

Performance optimization: See references/optimization.md for PagedAttention tuning, continuous batching details, and benchmark results.

Quantization guide: See references/quantization.md for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.

Troubleshooting: See references/troubleshooting.md for detailed error messages, debugging steps, and performance diagnostics.

Hardware requirements

Small models (7B-13B): 1x A10 (24GB) or A100 (40GB)
Medium models (30B-40B): 2x A100 (40GB) with tensor parallelism
Large models (70B+): 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ

Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs

Resources

Official docs: https://docs.vllm.ai
GitHub: https://github.com/vllm-project/vllm
Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
Community: https://discuss.vllm.ai

serving-llms-vllm

Safety Notice

Copy this and send it to your AI assistant to learn