llama.cpp
Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
When to use llama.cpp
Use llama.cpp when:
-
Running on CPU-only machines
-
Deploying on Apple Silicon (M1/M2/M3/M4)
-
Using AMD or Intel GPUs (no CUDA)
-
Edge deployment (Raspberry Pi, embedded systems)
-
Need simple deployment without Docker/Python
Use TensorRT-LLM instead when:
-
Have NVIDIA GPUs (A100/H100)
-
Need maximum throughput (100K+ tok/s)
-
Running in datacenter with CUDA
Use vLLM instead when:
-
Have NVIDIA GPUs
-
Need Python-first API
-
Want PagedAttention
Quick start
Installation
macOS/Linux
brew install llama.cpp
Or build from source
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make
With Metal (Apple Silicon)
make LLAMA_METAL=1
With CUDA (NVIDIA)
make LLAMA_CUDA=1
With ROCm (AMD)
make LLAMA_HIP=1
Download model
Download from HuggingFace (GGUF format)
huggingface-cli download
TheBloke/Llama-2-7B-Chat-GGUF
llama-2-7b-chat.Q4_K_M.gguf
--local-dir models/
Or convert from HuggingFace
python convert_hf_to_gguf.py models/llama-2-7b-chat/
Run inference
Simple chat
./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
-p "Explain quantum computing"
-n 256 # Max tokens
Interactive chat
./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
--interactive
Server mode
Start OpenAI-compatible server
./llama-server
-m models/llama-2-7b-chat.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
-ngl 32 # Offload 32 layers to GPU
Client request
curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "llama-2-7b-chat",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7,
"max_tokens": 100
}'
Quantization formats
GGUF format overview
Format Bits Size (7B) Speed Quality Use Case
Q4_K_M 4.5 4.1 GB Fast Good Recommended default
Q4_K_S 4.3 3.9 GB Faster Lower Speed critical
Q5_K_M 5.5 4.8 GB Medium Better Quality critical
Q6_K 6.5 5.5 GB Slower Best Maximum quality
Q8_0 8.0 7.0 GB Slow Excellent Minimal degradation
Q2_K 2.5 2.7 GB Fastest Poor Testing only
Choosing quantization
General use (balanced)
Q4_K_M # 4-bit, medium quality
Maximum speed (more degradation)
Q2_K or Q3_K_M
Maximum quality (slower)
Q6_K or Q8_0
Very large models (70B, 405B)
Q3_K_M or Q4_K_S # Lower bits to fit in memory
Hardware acceleration
Apple Silicon (Metal)
Build with Metal
make LLAMA_METAL=1
Run with GPU acceleration (automatic)
./llama-cli -m model.gguf -ngl 999 # Offload all layers
Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
NVIDIA GPUs (CUDA)
Build with CUDA
make LLAMA_CUDA=1
Offload layers to GPU
./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers
Hybrid CPU+GPU for large models
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
AMD GPUs (ROCm)
Build with ROCm
make LLAMA_HIP=1
Run with AMD GPU
./llama-cli -m model.gguf -ngl 999
Common patterns
Batch processing
Process multiple prompts from file
cat prompts.txt | ./llama-cli
-m model.gguf
--batch-size 512
-n 100
Constrained generation
JSON output with grammar
./llama-cli
-m model.gguf
-p "Generate a person: "
--grammar-file grammars/json.gbnf
Outputs valid JSON only
Context size
Increase context (default 512)
./llama-cli
-m model.gguf
-c 4096 # 4K context window
Very long context (if model supports)
./llama-cli -m model.gguf -c 32768 # 32K context
Performance benchmarks
CPU performance (Llama 2-7B Q4_K_M)
CPU Threads Speed Cost
Apple M3 Max 16 50 tok/s $0 (local)
AMD Ryzen 9 7950X 32 35 tok/s $0.50/hour
Intel i9-13900K 32 30 tok/s $0.40/hour
AWS c7i.16xlarge 64 40 tok/s $2.88/hour
GPU acceleration (Llama 2-7B Q4_K_M)
GPU Speed vs CPU Cost
NVIDIA RTX 4090 120 tok/s 3-4× $0 (local)
NVIDIA A10 80 tok/s 2-3× $1.00/hour
AMD MI250 70 tok/s 2× $2.00/hour
Apple M3 Max (Metal) 50 tok/s ~Same $0 (local)
Supported models
LLaMA family:
-
Llama 2 (7B, 13B, 70B)
-
Llama 3 (8B, 70B, 405B)
-
Code Llama
Mistral family:
-
Mistral 7B
-
Mixtral 8x7B, 8x22B
Other:
-
Falcon, BLOOM, GPT-J
-
Phi-3, Gemma, Qwen
-
LLaVA (vision), Whisper (audio)
Find models: https://huggingface.co/models?library=gguf
References
-
Quantization Guide - GGUF formats, conversion, quality comparison
-
Server Deployment - API endpoints, Docker, monitoring
-
Optimization - Performance tuning, hybrid CPU+GPU
Resources
-
Discord: https://discord.gg/llama-cpp