TensorRT-LLM
NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.
When to use TensorRT-LLM
Use TensorRT-LLM when:
-
Deploying on NVIDIA GPUs (A100, H100, GB200)
-
Need maximum throughput (24,000+ tokens/sec on Llama 3)
-
Require low latency for real-time applications
-
Working with quantized models (FP8, INT4, FP4)
-
Scaling across multiple GPUs or nodes
Use vLLM instead when:
-
Need simpler setup and Python-first API
-
Want PagedAttention without TensorRT compilation
-
Working with AMD GPUs or non-NVIDIA hardware
Use llama.cpp instead when:
-
Deploying on CPU or Apple Silicon
-
Need edge deployment without NVIDIA GPUs
-
Want simpler GGUF quantization format
Quick start
Installation
Docker (recommended)
docker pull nvidia/tensorrt_llm:latest
pip install
pip install tensorrt_llm==1.2.0rc3
Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12
Basic inference
from tensorrt_llm import LLM, SamplingParams
Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3-8B")
Configure sampling
sampling_params = SamplingParams( max_tokens=100, temperature=0.7, top_p=0.9 )
Generate
prompts = ["Explain quantum computing"] outputs = llm.generate(prompts, sampling_params)
for output in outputs: print(output.text)
Serving with trtllm-serve
Start server (automatic model download and compilation)
trtllm-serve meta-llama/Meta-Llama-3-8B
--tp_size 4 \ # Tensor parallelism (4 GPUs)
--max_batch_size 256
--max_num_tokens 4096
Client request
curl -X POST http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "meta-llama/Meta-Llama-3-8B",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7,
"max_tokens": 100
}'
Key features
Performance optimizations
-
In-flight batching: Dynamic batching during generation
-
Paged KV cache: Efficient memory management
-
Flash Attention: Optimized attention kernels
-
Quantization: FP8, INT4, FP4 for 2-4× faster inference
-
CUDA graphs: Reduced kernel launch overhead
Parallelism
-
Tensor parallelism (TP): Split model across GPUs
-
Pipeline parallelism (PP): Layer-wise distribution
-
Expert parallelism: For Mixture-of-Experts models
-
Multi-node: Scale beyond single machine
Advanced features
-
Speculative decoding: Faster generation with draft models
-
LoRA serving: Efficient multi-adapter deployment
-
Disaggregated serving: Separate prefill and generation
Common patterns
Quantized model (FP8)
from tensorrt_llm import LLM
Load FP8 quantized model (2× faster, 50% memory)
llm = LLM( model="meta-llama/Meta-Llama-3-70B", dtype="fp8", max_num_tokens=8192 )
Inference same as before
outputs = llm.generate(["Summarize this article..."])
Multi-GPU deployment
Tensor parallelism across 8 GPUs
llm = LLM( model="meta-llama/Meta-Llama-3-405B", tensor_parallel_size=8, dtype="fp8" )
Batch inference
Process 100 prompts efficiently
prompts = [f"Question {i}: ..." for i in range(100)]
outputs = llm.generate( prompts, sampling_params=SamplingParams(max_tokens=200) )
Automatic in-flight batching for maximum throughput
Performance benchmarks
Meta Llama 3-8B (H100 GPU):
-
Throughput: 24,000 tokens/sec
-
Latency: ~10ms per token
-
vs PyTorch: 100× faster
Llama 3-70B (8× A100 80GB):
-
FP8 quantization: 2× faster than FP16
-
Memory: 50% reduction with FP8
Supported models
-
LLaMA family: Llama 2, Llama 3, CodeLlama
-
GPT family: GPT-2, GPT-J, GPT-NeoX
-
Qwen: Qwen, Qwen2, QwQ
-
DeepSeek: DeepSeek-V2, DeepSeek-V3
-
Mixtral: Mixtral-8x7B, Mixtral-8x22B
-
Vision: LLaVA, Phi-3-vision
-
100+ models on HuggingFace
References
-
Optimization Guide - Quantization, batching, KV cache tuning
-
Multi-GPU Setup - Tensor/pipeline parallelism, multi-node
-
Serving Guide - Production deployment, monitoring, autoscaling
Resources