llama.cpp

Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.

When to use llama.cpp

Use llama.cpp when:

Running on CPU-only machines
Deploying on Apple Silicon (M1/M2/M3/M4)
Using AMD or Intel GPUs (no CUDA)
Edge deployment (Raspberry Pi, embedded systems)
Need simple deployment without Docker/Python

Use TensorRT-LLM instead when:

Have NVIDIA GPUs (A100/H100)
Need maximum throughput (100K+ tok/s)
Running in datacenter with CUDA

Use vLLM instead when:

Have NVIDIA GPUs
Need Python-first API
Want PagedAttention

Quick start

Installation

macOS/Linux

brew install llama.cpp

Or build from source

git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make

With Metal (Apple Silicon)

make LLAMA_METAL=1

With CUDA (NVIDIA)

make LLAMA_CUDA=1

With ROCm (AMD)

make LLAMA_HIP=1

Download model

Download from HuggingFace (GGUF format)

huggingface-cli download
TheBloke/Llama-2-7B-Chat-GGUF
llama-2-7b-chat.Q4_K_M.gguf
--local-dir models/

Or convert from HuggingFace

python convert_hf_to_gguf.py models/llama-2-7b-chat/

Run inference

Simple chat

./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
-p "Explain quantum computing"
-n 256 # Max tokens

Interactive chat

./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
--interactive

Server mode

Start OpenAI-compatible server

./llama-server
-m models/llama-2-7b-chat.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
-ngl 32 # Offload 32 layers to GPU

Client request

curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "llama-2-7b-chat", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'

Quantization formats

GGUF format overview

Format Bits Size (7B) Speed Quality Use Case

Q4_K_M 4.5 4.1 GB Fast Good Recommended default

Q4_K_S 4.3 3.9 GB Faster Lower Speed critical

Q5_K_M 5.5 4.8 GB Medium Better Quality critical

Q6_K 6.5 5.5 GB Slower Best Maximum quality

Q8_0 8.0 7.0 GB Slow Excellent Minimal degradation

Q2_K 2.5 2.7 GB Fastest Poor Testing only

Choosing quantization

General use (balanced)

Q4_K_M # 4-bit, medium quality

Maximum speed (more degradation)

Q2_K or Q3_K_M

Maximum quality (slower)

Q6_K or Q8_0

Very large models (70B, 405B)

Q3_K_M or Q4_K_S # Lower bits to fit in memory

Hardware acceleration

Apple Silicon (Metal)

Build with Metal

make LLAMA_METAL=1

Run with GPU acceleration (automatic)

./llama-cli -m model.gguf -ngl 999 # Offload all layers

Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)

NVIDIA GPUs (CUDA)

Build with CUDA

make LLAMA_CUDA=1

Offload layers to GPU

./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers

Hybrid CPU+GPU for large models

./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest

AMD GPUs (ROCm)

Build with ROCm

make LLAMA_HIP=1

Run with AMD GPU

./llama-cli -m model.gguf -ngl 999

Common patterns

Batch processing

Process multiple prompts from file

cat prompts.txt | ./llama-cli
-m model.gguf
--batch-size 512
-n 100

Constrained generation

JSON output with grammar

./llama-cli
-m model.gguf
-p "Generate a person: "
--grammar-file grammars/json.gbnf

Outputs valid JSON only

Context size

Increase context (default 512)

./llama-cli
-m model.gguf
-c 4096 # 4K context window

Very long context (if model supports)

./llama-cli -m model.gguf -c 32768 # 32K context

Performance benchmarks

CPU performance (Llama 2-7B Q4_K_M)

CPU Threads Speed Cost

Apple M3 Max 16 50 tok/s $0 (local)

AMD Ryzen 9 7950X 32 35 tok/s $0.50/hour

Intel i9-13900K 32 30 tok/s $0.40/hour

AWS c7i.16xlarge 64 40 tok/s $2.88/hour

GPU acceleration (Llama 2-7B Q4_K_M)

GPU Speed vs CPU Cost

NVIDIA RTX 4090 120 tok/s 3-4× $0 (local)

NVIDIA A10 80 tok/s 2-3× $1.00/hour

AMD MI250 70 tok/s 2× $2.00/hour

Apple M3 Max (Metal) 50 tok/s ~Same $0 (local)

Supported models

LLaMA family:

Llama 2 (7B, 13B, 70B)
Llama 3 (8B, 70B, 405B)
Code Llama

Mistral family:

Mistral 7B
Mixtral 8x7B, 8x22B

Other:

Falcon, BLOOM, GPT-J
Phi-3, Gemma, Qwen
LLaVA (vision), Whisper (audio)

Find models: https://huggingface.co/models?library=gguf

References

Quantization Guide - GGUF formats, conversion, quality comparison
Server Deployment - API endpoints, Docker, monitoring
Optimization - Performance tuning, hybrid CPU+GPU

Resources

GitHub: https://github.com/ggerganov/llama.cpp
Models: https://huggingface.co/models?library=gguf
Discord: https://discord.gg/llama-cpp

llama-cpp

Safety Notice

Copy this and send it to your AI assistant to learn