SGLang

High-performance serving framework for LLMs and VLMs with RadixAttention for automatic prefix caching.

When to use SGLang

Use SGLang when:

Need structured outputs (JSON, regex, grammar)
Building agents with repeated prefixes (system prompts, tools)
Agentic workflows with function calling
Multi-turn conversations with shared context
Need faster JSON decoding (3× vs standard)

Use vLLM instead when:

Simple text generation without structure
Don't need prefix caching
Want mature, widely-tested production system

Use TensorRT-LLM instead when:

Maximum single-request latency (no batching needed)
NVIDIA-only deployment
Need FP8/INT4 quantization on H100

Quick start

Installation

pip install (recommended)

pip install "sglang[all]"

With FlashInfer (faster, CUDA 11.8/12.1)

pip install sglang[all] flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

From source

git clone https://github.com/sgl-project/sglang.git cd sglang pip install -e "python[all]"

Launch server

Basic server (Llama 3-8B)

python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000

With RadixAttention (automatic prefix caching)

python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
--enable-radix-cache # Default: enabled

Multi-GPU (tensor parallelism)

python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--tp 4
--port 30000

Basic inference

import sglang as sgl

Set backend

sgl.set_default_backend(sgl.OpenAI("http://localhost:30000/v1"))

Simple generation

@sgl.function def simple_gen(s, question): s += "Q: " + question + "\n" s += "A:" + sgl.gen("answer", max_tokens=100)

Run

state = simple_gen.run(question="What is the capital of France?") print(state["answer"])

Output: "The capital of France is Paris."

Structured JSON output

import sglang as sgl

@sgl.function def extract_person(s, text): s += f"Extract person information from: {text}\n" s += "Output JSON:\n"

# Constrained JSON generation
s += sgl.gen(
    "json_output",
    max_tokens=200,
    regex=r'\{"name": "[^"]+", "age": \d+, "occupation": "[^"]+"\}'
)

Run

state = extract_person.run( text="John Smith is a 35-year-old software engineer." ) print(state["json_output"])

Output: {"name": "John Smith", "age": 35, "occupation": "software engineer"}

RadixAttention (Key Innovation)

What it does: Automatically caches and reuses common prefixes across requests.

Performance:

5× faster for agentic workloads with shared system prompts
10× faster for few-shot prompting with repeated examples
Zero configuration - works automatically

How it works:

Builds radix tree of all processed tokens
Automatically detects shared prefixes
Reuses KV cache for matching prefixes
Only computes new tokens

Example (Agent with system prompt):

Request 1: [SYSTEM_PROMPT] + "What's the weather?" → Computes full prompt (1000 tokens)

Request 2: [SAME_SYSTEM_PROMPT] + "Book a flight" → Reuses system prompt KV cache (998 tokens) → Only computes 2 new tokens → 5× faster!

Structured generation patterns

JSON with schema

@sgl.function def structured_extraction(s, article): s += f"Article: {article}\n\n" s += "Extract key information as JSON:\n"

# JSON schema constraint
schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "author": {"type": "string"},
        "summary": {"type": "string"},
        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}
    },
    "required": ["title", "author", "summary", "sentiment"]
}

s += sgl.gen("info", max_tokens=300, json_schema=schema)

state = structured_extraction.run(article="...") print(state["info"])

Output: Valid JSON matching schema

Regex-constrained generation

@sgl.function def extract_email(s, text): s += f"Extract email from: {text}\n" s += "Email: "

# Email regex pattern
s += sgl.gen(
    "email",
    max_tokens=50,
    regex=r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
)

state = extract_email.run(text="Contact john.doe@example.com for details") print(state["email"])

Output: "john.doe@example.com"

Grammar-based generation

@sgl.function def generate_code(s, description): s += f"Generate Python code for: {description}\n" s += "```python\n"

# EBNF grammar for Python
python_grammar = """
?start: function_def
function_def: "def" NAME "(" [parameters] "):" suite
parameters: parameter ("," parameter)*
parameter: NAME
suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
"""

s += sgl.gen("code", max_tokens=200, grammar=python_grammar)
s += "\n```"

Agent workflows with function calling

import sglang as sgl

Define tools

tools = [ { "name": "get_weather", "description": "Get weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string"} } } }, { "name": "book_flight", "description": "Book a flight", "parameters": { "type": "object", "properties": { "from": {"type": "string"}, "to": {"type": "string"}, "date": {"type": "string"} } } } ]

@sgl.function def agent_workflow(s, user_query, tools): # System prompt (cached with RadixAttention) s += "You are a helpful assistant with access to tools.\n" s += f"Available tools: {tools}\n\n"

# User query
s += f"User: {user_query}\n"
s += "Assistant: "

# Generate with function calling
s += sgl.gen(
    "response",
    max_tokens=200,
    tools=tools,  # SGLang handles tool call format
    stop=["User:", "\n\n"]
)

Multiple queries reuse system prompt

state1 = agent_workflow.run( user_query="What's the weather in NYC?", tools=tools )

First call: Computes full system prompt

state2 = agent_workflow.run( user_query="Book a flight to LA", tools=tools )

Second call: Reuses system prompt (5× faster)

Performance benchmarks

RadixAttention speedup

Few-shot prompting (10 examples in prompt):

vLLM: 2.5 sec/request
SGLang: 0.25 sec/request (10× faster)
Throughput: 4× higher

Agent workflows (1000-token system prompt):

vLLM: 1.8 sec/request
SGLang: 0.35 sec/request (5× faster)

JSON decoding:

Standard: 45 tok/s
SGLang: 135 tok/s (3× faster)

Throughput (Llama 3-8B, A100)

Workload vLLM SGLang Speedup

Simple generation 2500 tok/s 2800 tok/s 1.12×

Few-shot (10 examples) 500 tok/s 5000 tok/s 10×

Agent (tool calls) 800 tok/s 4000 tok/s 5×

JSON output 600 tok/s 2400 tok/s 4×

Multi-turn conversations

@sgl.function def multi_turn_chat(s, history, new_message): # System prompt (always cached) s += "You are a helpful AI assistant.\n\n"

# Conversation history (cached as it grows)
for msg in history:
    s += f"{msg['role']}: {msg['content']}\n"

# New user message (only new part)
s += f"User: {new_message}\n"
s += "Assistant: "
s += sgl.gen("response", max_tokens=200)

Turn 1

history = [] state = multi_turn_chat.run(history=history, new_message="Hi there!") history.append({"role": "User", "content": "Hi there!"}) history.append({"role": "Assistant", "content": state["response"]})

Turn 2 (reuses Turn 1 KV cache)

state = multi_turn_chat.run(history=history, new_message="What's 2+2?")

Only computes new message (much faster!)

Turn 3 (reuses Turn 1 + Turn 2 KV cache)

state = multi_turn_chat.run(history=history, new_message="Tell me a joke")

Progressively faster as history grows

Advanced features

Speculative decoding

Launch with draft model (2-3× faster)

python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--speculative-model meta-llama/Meta-Llama-3-8B-Instruct
--speculative-num-steps 5

Multi-modal (vision models)

@sgl.function def describe_image(s, image_path): s += sgl.image(image_path) s += "Describe this image in detail: " s += sgl.gen("description", max_tokens=200)

state = describe_image.run(image_path="photo.jpg") print(state["description"])

Batching and parallel requests

Automatic batching (continuous batching)

states = sgl.run_batch( [ simple_gen.bind(question="What is AI?"), simple_gen.bind(question="What is ML?"), simple_gen.bind(question="What is DL?"), ] )

All 3 processed in single batch (efficient)

OpenAI-compatible API

Start server with OpenAI API

python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000

Use with OpenAI client

curl http://localhost:30000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "default", "messages": [ {"role": "system", "content": "You are helpful"}, {"role": "user", "content": "Hello"} ], "temperature": 0.7, "max_tokens": 100 }'

Works with OpenAI Python SDK

from openai import OpenAI client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create( model="default", messages=[{"role": "user", "content": "Hello"}] )

Supported models

Text models:

Llama 2, Llama 3, Llama 3.1, Llama 3.2
Mistral, Mixtral
Qwen, Qwen2, QwQ
DeepSeek-V2, DeepSeek-V3
Gemma, Phi-3

Vision models:

LLaVA, LLaVA-OneVision
Phi-3-Vision
Qwen2-VL

100+ models from HuggingFace

Hardware support

NVIDIA: A100, H100, L4, T4 (CUDA 11.8+) AMD: MI300, MI250 (ROCm 6.0+) Intel: Xeon with GPU (coming soon) Apple: M1/M2/M3 via MPS (experimental)

References

Structured Generation Guide - JSON schemas, regex, grammars, validation
RadixAttention Deep Dive - How it works, optimization, benchmarks
Production Deployment - Multi-GPU, monitoring, autoscaling

Resources

GitHub: https://github.com/sgl-project/sglang
Docs: https://sgl-project.github.io/
Paper: RadixAttention (arXiv:2312.07104)
Discord: https://discord.gg/sglang

sglang

Safety Notice

Copy this and send it to your AI assistant to learn

pip install (recommended)

With FlashInfer (faster, CUDA 11.8/12.1)

From source

Basic server (Llama 3-8B)

With RadixAttention (automatic prefix caching)

Multi-GPU (tensor parallelism)

Set backend

Simple generation

Run

Output: "The capital of France is Paris."

Run

Output: {"name": "John Smith", "age": 35, "occupation": "software engineer"}

Output: Valid JSON matching schema

Output: "john.doe@example.com"

Define tools

Multiple queries reuse system prompt

First call: Computes full system prompt

Second call: Reuses system prompt (5× faster)

Turn 1

Turn 2 (reuses Turn 1 KV cache)

Only computes new message (much faster!)

Turn 3 (reuses Turn 1 + Turn 2 KV cache)

Progressively faster as history grows

Launch with draft model (2-3× faster)

Automatic batching (continuous batching)

All 3 processed in single batch (efficient)

Start server with OpenAI API

Use with OpenAI client

Works with OpenAI Python SDK

Source Transparency

Related Skills

ml-paper-writing

mlflow

faiss

serving-llms-vllm