QLoRA: Quantized Low-Rank Adaptation

QLoRA enables fine-tuning of large language models on consumer GPUs by combining 4-bit quantization with LoRA adapters. A 65B model can be fine-tuned on a single 48GB GPU while matching 16-bit fine-tuning performance.

Prerequisites: This skill assumes familiarity with LoRA. See the lora skill for LoRA fundamentals (LoraConfig, target_modules, training patterns).

Core Innovations
BitsAndBytesConfig Deep Dive
Memory Requirements
Complete Training Example
Inference and Merging
Troubleshooting
Best Practices

Core Innovations

QLoRA introduces three techniques that reduce memory usage without sacrificing performance:

4-bit NormalFloat (NF4)

NF4 is an information-theoretically optimal quantization data type for normally distributed weights. Neural network weights are typically normally distributed, making NF4 more efficient than standard 4-bit floats.

Storage: 4-bit NF4 (quantized weights)
Compute: 16-bit BF16 (dequantized for forward/backward pass)

The key insight: weights are stored in 4-bit but dequantized to bf16 for computation. Only the frozen base model is quantized; LoRA adapters remain in full precision.

NF4 vs FP4:

Quantization	Description	Use Case
`nf4`	Normalized Float 4-bit, optimal for normal distributions	Default, recommended
`fp4`	Standard 4-bit float	Legacy, rarely needed

Double Quantization

Standard quantization requires storing scaling constants (typically fp32) for each quantization block. Double quantization quantizes these constants too:

First quantization:  weights → 4-bit + fp32 scaling constants
Double quantization: scaling constants → 8-bit + fp32 second-level constants

This saves approximately 0.37 bits per parameter—significant for billion-parameter models:

7B model: ~325 MB savings
70B model: ~3.2 GB savings

Paged Optimizers

During training, gradient checkpointing can cause memory spikes when processing long sequences. Paged optimizers use NVIDIA unified memory to automatically transfer optimizer states between GPU and CPU:

Normal training: OOM on memory spike
Paged optimizers: GPU ↔ CPU transfer handles spikes gracefully

This is handled automatically by bitsandbytes when using 4-bit training.

BitsAndBytesConfig Deep Dive

All Parameters Explained

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    # Core 4-bit settings
    load_in_4bit=True,              # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4",      # "nf4" (recommended) or "fp4"

    # Double quantization
    bnb_4bit_use_double_quant=True, # Quantize the quantization constants

    # Compute precision
    bnb_4bit_compute_dtype=torch.bfloat16,  # Dequantize to this dtype for compute

    # Optional: specific storage type (usually auto-detected)
    bnb_4bit_quant_storage=torch.uint8,     # Storage dtype for quantized weights
)

Compute Dtype Selection

Dtype	Hardware	Notes
`torch.bfloat16`	Ampere+ (RTX 30xx, A100)	Recommended, faster
`torch.float16`	Older GPUs (V100, RTX 20xx)	Use if bf16 not supported
`torch.float32`	Any	Slower, only for debugging

Check bf16 support:

import torch
print(torch.cuda.is_bf16_supported())  # True on Ampere+

Comparison: Quantization Options

# Recommended: NF4 + double quant + bf16
optimal_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Maximum memory savings (slightly slower)
max_savings_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,  # fp16 uses less memory than bf16
)

# 8-bit alternative (less compression, sometimes more stable)
eight_bit_config = BitsAndBytesConfig(
    load_in_8bit=True,
)

Memory Requirements

Model Size	Full Fine-tuning	LoRA (16-bit)	QLoRA (4-bit)
7B	~60 GB	~16 GB	~6 GB
13B	~104 GB	~28 GB	~10 GB
34B	~272 GB	~75 GB	~20 GB
70B	~560 GB	~160 GB	~48 GB

Notes:

QLoRA memory includes model + optimizer states + activations
Actual usage varies with batch size, sequence length, and gradient checkpointing
Add ~20% buffer for safe operation

GPU Recommendations

GPU VRAM	Max Model Size (QLoRA)
8 GB	7B (tight)
16 GB	7-13B
24 GB	13-34B
48 GB	34-70B
80 GB	70B+ comfortably

Complete Training Example

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

# 1. Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# 2. Load quantized model
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",  # Optional: faster attention
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# 3. Prepare for k-bit training (critical step!)
model = prepare_model_for_kbit_training(model)

# 4. LoRA config (see lora skill for parameter details)
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# 5. Dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")

def format_example(example):
    if example["input"]:
        return {"text": f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"}
    return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}

dataset = dataset.map(format_example)

# 6. Training
sft_config = SFTConfig(
    output_dir="./qlora-output",
    max_seq_length=512,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_steps=100,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    optim="paged_adamw_8bit",  # Paged optimizer for memory efficiency
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset,
    processing_class=tokenizer,
    dataset_text_field="text",
)

trainer.train()

# 7. Save adapter
model.save_pretrained("./qlora-adapter")
tokenizer.save_pretrained("./qlora-adapter")

Inference and Merging

Inference with Quantized Model

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

model_name = "meta-llama/Llama-3.1-8B"

# Load quantized base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load adapter
model = PeftModel.from_pretrained(base_model, "./qlora-adapter")
model.eval()

# Generate
inputs = tokenizer("### Instruction:\nExplain quantum computing.\n\n### Response:\n", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Merging to Full Precision

To merge QLoRA adapters into a full-precision model (for deployment without bitsandbytes):

from transformers import AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model in full precision (on CPU to avoid OOM)
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.bfloat16,
    device_map="cpu",
)

# Load adapter
model = PeftModel.from_pretrained(base_model, "./qlora-adapter")

# Merge and unload
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged-model")

Note: Merging requires enough RAM to hold the full-precision model. For 70B models, this means ~140GB RAM.

Troubleshooting

CUDA Version Issues

# Check CUDA version
nvcc --version
python -c "import torch; print(torch.version.cuda)"

# bitsandbytes requires CUDA 11.7+
# If version mismatch, reinstall:
pip uninstall bitsandbytes
pip install bitsandbytes --upgrade

"cannot find libcudart" or Missing Library Errors

# Find CUDA installation
find /usr -name "libcudart*" 2>/dev/null

# Set environment variable
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Or for conda:
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Slow Training

Common cause: compute dtype mismatch

# Check if model is using expected dtype
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"{name}: {param.dtype}")
        break  # All LoRA params should match

# Ensure bf16 is used in training args if BitsAndBytesConfig uses bf16
# Mismatch causes constant dtype conversions

Out of Memory

# 1. Enable gradient checkpointing
model.gradient_checkpointing_enable()

# 2. Reduce batch size, increase accumulation
per_device_train_batch_size = 1
gradient_accumulation_steps = 16

# 3. Use paged optimizer
optim = "paged_adamw_8bit"

# 4. Reduce sequence length
max_seq_length = 256

# 5. Target fewer modules
target_modules = ["q_proj", "v_proj"]  # Minimal set

Model Loads But Training Fails

# Ensure prepare_model_for_kbit_training is called
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)  # Don't skip this!

# Enable input gradients if needed
model.enable_input_require_grads()

Best Practices

Always use prepare_model_for_kbit_training: This enables gradient computation through the frozen quantized layers
Match compute dtype with training precision: If bnb_4bit_compute_dtype=torch.bfloat16, use bf16=True in training args
Use paged optimizers for large models: optim="paged_adamw_8bit" or "paged_adamw_32bit" handles memory spikes
Start with NF4 + double quantization: This is the recommended default; only change if debugging
Gradient checkpointing is essential: Always enable for QLoRA training to fit larger batch sizes
Test inference before long training runs: Load the model and generate a few tokens to catch configuration issues early
Monitor GPU memory: Use nvidia-smi or torch.cuda.memory_summary() to track actual usage
Consider 8-bit for unstable training: If 4-bit training shows instability, try load_in_8bit=True as a middle ground

qlora

Safety Notice

Copy this and send it to your AI assistant to learn