GGUF - Quantization Format for llama.cpp

The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.

When to use GGUF

Use GGUF when:

Deploying on consumer hardware (laptops, desktops)
Running on Apple Silicon (M1/M2/M3) with Metal acceleration
Need CPU inference without GPU requirements
Want flexible quantization (Q2_K to Q8_0)
Using local AI tools (LM Studio, Ollama, text-generation-webui)

Key advantages:

Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support
No Python runtime: Pure C/C++ inference
Flexible quantization: 2-8 bit with various methods (K-quants)
Ecosystem support: LM Studio, Ollama, koboldcpp, and more
imatrix: Importance matrix for better low-bit quality

Use alternatives instead:

AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs
HQQ: Fast calibration-free quantization for HuggingFace
bitsandbytes: Simple integration with transformers library
TensorRT-LLM: Production NVIDIA deployment with maximum speed

Quick start

Installation

Clone llama.cpp

git clone https://github.com/ggml-org/llama.cpp cd llama.cpp

Build (CPU)

make

Build with CUDA (NVIDIA)

make GGML_CUDA=1

Build with Metal (Apple Silicon)

make GGML_METAL=1

Install Python bindings (optional)

pip install llama-cpp-python

Convert model to GGUF

Install requirements

pip install -r requirements.txt

Convert HuggingFace model to GGUF (FP16)

python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf

Or specify output type

python convert_hf_to_gguf.py ./path/to/model
--outfile model-f16.gguf
--outtype f16

Quantize model

Basic quantization to Q4_K_M

./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

Quantize with importance matrix (better quality)

./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

Run inference

CLI inference

./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"

Interactive mode

./llama-cli -m model-q4_k_m.gguf --interactive

With GPU offload

./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"

Quantization types

K-quant methods (recommended)

Type Bits Size (7B) Quality Use Case

Q2_K 2.5 ~2.8 GB Low Extreme compression

Q3_K_S 3.0 ~3.0 GB Low-Med Memory constrained

Q3_K_M 3.3 ~3.3 GB Medium Balance

Q4_K_S 4.0 ~3.8 GB Med-High Good balance

Q4_K_M 4.5 ~4.1 GB High Recommended default

Q5_K_S 5.0 ~4.6 GB High Quality focused

Q5_K_M 5.5 ~4.8 GB Very High High quality

Q6_K 6.0 ~5.5 GB Excellent Near-original

Q8_0 8.0 ~7.2 GB Best Maximum quality

Legacy methods

Type Description

Q4_0 4-bit, basic

Q4_1 4-bit with delta

Q5_0 5-bit, basic

Q5_1 5-bit with delta

Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.

Conversion workflows

Workflow 1: HuggingFace to GGUF

1. Download model

huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b

2. Convert to GGUF (FP16)

python convert_hf_to_gguf.py ./llama-3.1-8b
--outfile llama-3.1-8b-f16.gguf
--outtype f16

3. Quantize

./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M

4. Test

./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50

Workflow 2: With importance matrix (better quality)

1. Convert to GGUF

python convert_hf_to_gguf.py ./model --outfile model-f16.gguf

2. Create calibration text (diverse samples)

cat > calibration.txt << 'EOF' The quick brown fox jumps over the lazy dog. Machine learning is a subset of artificial intelligence. Python is a popular programming language.

Add more diverse text samples...

EOF

3. Generate importance matrix

./llama-imatrix -m model-f16.gguf
-f calibration.txt
--chunk 512
-o model.imatrix
-ngl 35 # GPU layers if available

4. Quantize with imatrix

./llama-quantize --imatrix model.imatrix
model-f16.gguf
model-q4_k_m.gguf
Q4_K_M

Workflow 3: Multiple quantizations

#!/bin/bash MODEL="llama-3.1-8b-f16.gguf" IMATRIX="llama-3.1-8b.imatrix"

Generate imatrix once

./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35

Create multiple quantizations

for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do OUTPUT="llama-3.1-8b-${QUANT,,}.gguf" ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))" done

Python usage

llama-cpp-python

from llama_cpp import Llama

Load model

llm = Llama( model_path="./model-q4_k_m.gguf", n_ctx=4096, # Context window n_gpu_layers=35, # GPU offload (0 for CPU only) n_threads=8 # CPU threads )

Generate

output = llm( "What is machine learning?", max_tokens=256, temperature=0.7, stop=["</s>", "\n\n"] ) print(output["choices"][0]["text"])

Chat completion

from llama_cpp import Llama

llm = Llama( model_path="./model-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=35, chat_format="llama-3" # Or "chatml", "mistral", etc. )

messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Python?"} ]

response = llm.create_chat_completion( messages=messages, max_tokens=256, temperature=0.7 ) print(response["choices"][0]["message"]["content"])

Streaming

from llama_cpp import Llama

llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)

Stream tokens

for chunk in llm( "Explain quantum computing:", max_tokens=256, stream=True ): print(chunk["choices"][0]["text"], end="", flush=True)

Server mode

Start OpenAI-compatible server

Start server

./llama-server -m model-q4_k_m.gguf
--host 0.0.0.0
--port 8080
-ngl 35
-c 4096

Or with Python bindings

python -m llama_cpp.server
--model model-q4_k_m.gguf
--n_gpu_layers 35
--host 0.0.0.0
--port 8080

Use with OpenAI client

from openai import OpenAI

client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-needed" )

response = client.chat.completions.create( model="local-model", messages=[{"role": "user", "content": "Hello!"}], max_tokens=256 ) print(response.choices[0].message.content)

Hardware optimization

Apple Silicon (Metal)

Build with Metal

make clean && make GGML_METAL=1

Run with Metal acceleration

./llama-cli -m model.gguf -ngl 99 -p "Hello"

Python with Metal

llm = Llama( model_path="model.gguf", n_gpu_layers=99, # Offload all layers n_threads=1 # Metal handles parallelism )

NVIDIA CUDA

Build with CUDA

make clean && make GGML_CUDA=1

Run with CUDA

./llama-cli -m model.gguf -ngl 35 -p "Hello"

Specify GPU

CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35

CPU optimization

Build with AVX2/AVX512

make clean && make

Run with optimal threads

./llama-cli -m model.gguf -t 8 -p "Hello"

Python CPU config

llm = Llama( model_path="model.gguf", n_gpu_layers=0, # CPU only n_threads=8, # Match physical cores n_batch=512 # Batch size for prompt processing )

Integration with tools

Ollama

Create Modelfile

cat > Modelfile << 'EOF' FROM ./model-q4_k_m.gguf TEMPLATE """{{ .System }} {{ .Prompt }}""" PARAMETER temperature 0.7 PARAMETER num_ctx 4096 EOF

Create Ollama model

ollama create mymodel -f Modelfile

Run

ollama run mymodel "Hello!"

LM Studio

Place GGUF file in ~/.cache/lm-studio/models/
Open LM Studio and select the model
Configure context length and GPU offload
Start inference

text-generation-webui

Place in models folder

cp model-q4_k_m.gguf text-generation-webui/models/

Start with llama.cpp loader

python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35

Best practices

Use K-quants: Q4_K_M offers best quality/size balance
Use imatrix: Always use importance matrix for Q4 and below
GPU offload: Offload as many layers as VRAM allows
Context length: Start with 4096, increase if needed
Thread count: Match physical CPU cores, not logical
Batch size: Increase n_batch for faster prompt processing

Common issues

Model loads slowly:

Use mmap for faster loading

./llama-cli -m model.gguf --mmap

Out of memory:

Reduce GPU layers

./llama-cli -m model.gguf -ngl 20 # Reduce from 35

Or use smaller quantization

./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M

Poor quality at low bits:

Always use imatrix for Q4 and below

./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

References

Advanced Usage - Batching, speculative decoding, custom builds
Troubleshooting - Common issues, debugging, benchmarks

Resources

Repository: https://github.com/ggml-org/llama.cpp
Python Bindings: https://github.com/abetlen/llama-cpp-python
Pre-quantized Models: https://huggingface.co/TheBloke
GGUF Converter: https://huggingface.co/spaces/ggml-org/gguf-my-repo
License: MIT

gguf-quantization

Safety Notice

Copy this and send it to your AI assistant to learn

Clone llama.cpp

Build (CPU)

Build with CUDA (NVIDIA)

Build with Metal (Apple Silicon)

Install Python bindings (optional)

Install requirements

Convert HuggingFace model to GGUF (FP16)

Or specify output type

Basic quantization to Q4_K_M

Quantize with importance matrix (better quality)

CLI inference

Interactive mode

With GPU offload

1. Download model

2. Convert to GGUF (FP16)

3. Quantize

4. Test

1. Convert to GGUF

2. Create calibration text (diverse samples)

Add more diverse text samples...

3. Generate importance matrix

4. Quantize with imatrix

Generate imatrix once

Create multiple quantizations

Load model

Generate

Stream tokens

Start server

Or with Python bindings

Build with Metal

Run with Metal acceleration

Python with Metal

Build with CUDA

Run with CUDA

Specify GPU

Build with AVX2/AVX512

Run with optimal threads

Python CPU config

Create Modelfile

Create Ollama model

Run

Place in models folder

Start with llama.cpp loader

Use mmap for faster loading

Reduce GPU layers

Or use smaller quantization

Always use imatrix for Q4 and below

Source Transparency

Related Skills

ml-paper-writing

mlflow

faiss

serving-llms-vllm