GGUF - Quantization Format for llama.cpp
The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
When to use GGUF
Use GGUF when:
-
Deploying on consumer hardware (laptops, desktops)
-
Running on Apple Silicon (M1/M2/M3) with Metal acceleration
-
Need CPU inference without GPU requirements
-
Want flexible quantization (Q2_K to Q8_0)
-
Using local AI tools (LM Studio, Ollama, text-generation-webui)
Key advantages:
-
Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support
-
No Python runtime: Pure C/C++ inference
-
Flexible quantization: 2-8 bit with various methods (K-quants)
-
Ecosystem support: LM Studio, Ollama, koboldcpp, and more
-
imatrix: Importance matrix for better low-bit quality
Use alternatives instead:
-
AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs
-
HQQ: Fast calibration-free quantization for HuggingFace
-
bitsandbytes: Simple integration with transformers library
-
TensorRT-LLM: Production NVIDIA deployment with maximum speed
Quick start
Installation
Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp cd llama.cpp
Build (CPU)
make
Build with CUDA (NVIDIA)
make GGML_CUDA=1
Build with Metal (Apple Silicon)
make GGML_METAL=1
Install Python bindings (optional)
pip install llama-cpp-python
Convert model to GGUF
Install requirements
pip install -r requirements.txt
Convert HuggingFace model to GGUF (FP16)
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
Or specify output type
python convert_hf_to_gguf.py ./path/to/model
--outfile model-f16.gguf
--outtype f16
Quantize model
Basic quantization to Q4_K_M
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
Quantize with importance matrix (better quality)
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
Run inference
CLI inference
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
Interactive mode
./llama-cli -m model-q4_k_m.gguf --interactive
With GPU offload
./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
Quantization types
K-quant methods (recommended)
Type Bits Size (7B) Quality Use Case
Q2_K 2.5 ~2.8 GB Low Extreme compression
Q3_K_S 3.0 ~3.0 GB Low-Med Memory constrained
Q3_K_M 3.3 ~3.3 GB Medium Balance
Q4_K_S 4.0 ~3.8 GB Med-High Good balance
Q4_K_M 4.5 ~4.1 GB High Recommended default
Q5_K_S 5.0 ~4.6 GB High Quality focused
Q5_K_M 5.5 ~4.8 GB Very High High quality
Q6_K 6.0 ~5.5 GB Excellent Near-original
Q8_0 8.0 ~7.2 GB Best Maximum quality
Legacy methods
Type Description
Q4_0 4-bit, basic
Q4_1 4-bit with delta
Q5_0 5-bit, basic
Q5_1 5-bit with delta
Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
Conversion workflows
Workflow 1: HuggingFace to GGUF
1. Download model
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
2. Convert to GGUF (FP16)
python convert_hf_to_gguf.py ./llama-3.1-8b
--outfile llama-3.1-8b-f16.gguf
--outtype f16
3. Quantize
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
4. Test
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
Workflow 2: With importance matrix (better quality)
1. Convert to GGUF
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
2. Create calibration text (diverse samples)
cat > calibration.txt << 'EOF' The quick brown fox jumps over the lazy dog. Machine learning is a subset of artificial intelligence. Python is a popular programming language.
Add more diverse text samples...
EOF
3. Generate importance matrix
./llama-imatrix -m model-f16.gguf
-f calibration.txt
--chunk 512
-o model.imatrix
-ngl 35 # GPU layers if available
4. Quantize with imatrix
./llama-quantize --imatrix model.imatrix
model-f16.gguf
model-q4_k_m.gguf
Q4_K_M
Workflow 3: Multiple quantizations
#!/bin/bash MODEL="llama-3.1-8b-f16.gguf" IMATRIX="llama-3.1-8b.imatrix"
Generate imatrix once
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
Create multiple quantizations
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do OUTPUT="llama-3.1-8b-${QUANT,,}.gguf" ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))" done
Python usage
llama-cpp-python
from llama_cpp import Llama
Load model
llm = Llama( model_path="./model-q4_k_m.gguf", n_ctx=4096, # Context window n_gpu_layers=35, # GPU offload (0 for CPU only) n_threads=8 # CPU threads )
Generate
output = llm( "What is machine learning?", max_tokens=256, temperature=0.7, stop=["</s>", "\n\n"] ) print(output["choices"][0]["text"])
Chat completion
from llama_cpp import Llama
llm = Llama( model_path="./model-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=35, chat_format="llama-3" # Or "chatml", "mistral", etc. )
messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Python?"} ]
response = llm.create_chat_completion( messages=messages, max_tokens=256, temperature=0.7 ) print(response["choices"][0]["message"]["content"])
Streaming
from llama_cpp import Llama
llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
Stream tokens
for chunk in llm( "Explain quantum computing:", max_tokens=256, stream=True ): print(chunk["choices"][0]["text"], end="", flush=True)
Server mode
Start OpenAI-compatible server
Start server
./llama-server -m model-q4_k_m.gguf
--host 0.0.0.0
--port 8080
-ngl 35
-c 4096
Or with Python bindings
python -m llama_cpp.server
--model model-q4_k_m.gguf
--n_gpu_layers 35
--host 0.0.0.0
--port 8080
Use with OpenAI client
from openai import OpenAI
client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-needed" )
response = client.chat.completions.create( model="local-model", messages=[{"role": "user", "content": "Hello!"}], max_tokens=256 ) print(response.choices[0].message.content)
Hardware optimization
Apple Silicon (Metal)
Build with Metal
make clean && make GGML_METAL=1
Run with Metal acceleration
./llama-cli -m model.gguf -ngl 99 -p "Hello"
Python with Metal
llm = Llama( model_path="model.gguf", n_gpu_layers=99, # Offload all layers n_threads=1 # Metal handles parallelism )
NVIDIA CUDA
Build with CUDA
make clean && make GGML_CUDA=1
Run with CUDA
./llama-cli -m model.gguf -ngl 35 -p "Hello"
Specify GPU
CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
CPU optimization
Build with AVX2/AVX512
make clean && make
Run with optimal threads
./llama-cli -m model.gguf -t 8 -p "Hello"
Python CPU config
llm = Llama( model_path="model.gguf", n_gpu_layers=0, # CPU only n_threads=8, # Match physical cores n_batch=512 # Batch size for prompt processing )
Integration with tools
Ollama
Create Modelfile
cat > Modelfile << 'EOF' FROM ./model-q4_k_m.gguf TEMPLATE """{{ .System }} {{ .Prompt }}""" PARAMETER temperature 0.7 PARAMETER num_ctx 4096 EOF
Create Ollama model
ollama create mymodel -f Modelfile
Run
ollama run mymodel "Hello!"
LM Studio
-
Place GGUF file in ~/.cache/lm-studio/models/
-
Open LM Studio and select the model
-
Configure context length and GPU offload
-
Start inference
text-generation-webui
Place in models folder
cp model-q4_k_m.gguf text-generation-webui/models/
Start with llama.cpp loader
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
Best practices
-
Use K-quants: Q4_K_M offers best quality/size balance
-
Use imatrix: Always use importance matrix for Q4 and below
-
GPU offload: Offload as many layers as VRAM allows
-
Context length: Start with 4096, increase if needed
-
Thread count: Match physical CPU cores, not logical
-
Batch size: Increase n_batch for faster prompt processing
Common issues
Model loads slowly:
Use mmap for faster loading
./llama-cli -m model.gguf --mmap
Out of memory:
Reduce GPU layers
./llama-cli -m model.gguf -ngl 20 # Reduce from 35
Or use smaller quantization
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
Poor quality at low bits:
Always use imatrix for Q4 and below
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
References
-
Advanced Usage - Batching, speculative decoding, custom builds
-
Troubleshooting - Common issues, debugging, benchmarks
Resources
-
Repository: https://github.com/ggml-org/llama.cpp
-
Python Bindings: https://github.com/abetlen/llama-cpp-python
-
Pre-quantized Models: https://huggingface.co/TheBloke
-
GGUF Converter: https://huggingface.co/spaces/ggml-org/gguf-my-repo
-
License: MIT