HQQ - Half-Quadratic Quantization
Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized backends.
When to use HQQ
Use HQQ when:
-
Quantizing models without calibration data (no dataset needed)
-
Need fast quantization (minutes vs hours for GPTQ/AWQ)
-
Deploying with vLLM or HuggingFace Transformers
-
Fine-tuning quantized models with LoRA/PEFT
-
Experimenting with extreme quantization (2-bit, 1-bit)
Key advantages:
-
No calibration: Quantize any model instantly without sample data
-
Multiple backends: PyTorch, ATEN, TorchAO, Marlin, BitBlas for optimized inference
-
Flexible precision: 8/4/3/2/1-bit with configurable group sizes
-
Framework integration: Native HuggingFace and vLLM support
-
PEFT compatible: Fine-tune quantized models with LoRA
Use alternatives instead:
-
AWQ: Need calibration-based accuracy, production serving
-
GPTQ: Maximum accuracy with calibration data available
-
bitsandbytes: Simple 8-bit/4-bit without custom backends
-
llama.cpp/GGUF: CPU inference, Apple Silicon deployment
Quick start
Installation
pip install hqq
With specific backend
pip install hqq[torch] # PyTorch backend pip install hqq[torchao] # TorchAO int4 backend pip install hqq[bitblas] # BitBlas backend pip install hqq[marlin] # Marlin backend
Basic quantization
from hqq.core.quantize import BaseQuantizeConfig, HQQLinear import torch.nn as nn
Configure quantization
config = BaseQuantizeConfig( nbits=4, # 4-bit quantization group_size=64, # Group size for quantization axis=1 # Quantize along output dimension )
Quantize a linear layer
linear = nn.Linear(4096, 4096) hqq_linear = HQQLinear(linear, config)
Use normally
output = hqq_linear(input_tensor)
Quantize full model with HuggingFace
from transformers import AutoModelForCausalLM, HqqConfig
Configure HQQ
quantization_config = HqqConfig( nbits=4, group_size=64, axis=1 )
Load and quantize
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=quantization_config, device_map="auto" )
Model is quantized and ready to use
Core concepts
Quantization configuration
HQQ uses BaseQuantizeConfig to define quantization parameters:
from hqq.core.quantize import BaseQuantizeConfig
Standard 4-bit config
config_4bit = BaseQuantizeConfig( nbits=4, # Bits per weight (1-8) group_size=64, # Weights per quantization group axis=1 # 0=input dim, 1=output dim )
Aggressive 2-bit config
config_2bit = BaseQuantizeConfig( nbits=2, group_size=16, # Smaller groups for low-bit axis=1 )
Mixed precision per layer type
layer_configs = { "self_attn.q_proj": BaseQuantizeConfig(nbits=4, group_size=64), "self_attn.k_proj": BaseQuantizeConfig(nbits=4, group_size=64), "self_attn.v_proj": BaseQuantizeConfig(nbits=4, group_size=64), "mlp.gate_proj": BaseQuantizeConfig(nbits=2, group_size=32), "mlp.up_proj": BaseQuantizeConfig(nbits=2, group_size=32), "mlp.down_proj": BaseQuantizeConfig(nbits=4, group_size=64), }
HQQLinear layer
The core quantized layer that replaces nn.Linear :
from hqq.core.quantize import HQQLinear import torch
Create quantized layer
linear = torch.nn.Linear(4096, 4096) hqq_layer = HQQLinear(linear, config)
Access quantized weights
W_q = hqq_layer.W_q # Quantized weights scale = hqq_layer.scale # Scale factors zero = hqq_layer.zero # Zero points
Dequantize for inspection
W_dequant = hqq_layer.dequantize()
Backends
HQQ supports multiple inference backends for different hardware:
from hqq.core.quantize import HQQLinear
Available backends
backends = [ "pytorch", # Pure PyTorch (default) "pytorch_compile", # torch.compile optimized "aten", # Custom CUDA kernels "torchao_int4", # TorchAO int4 matmul "gemlite", # GemLite CUDA kernels "bitblas", # BitBlas optimized "marlin", # Marlin 4-bit kernels ]
Set backend globally
HQQLinear.set_backend("torchao_int4")
Or per layer
hqq_layer.set_backend("marlin")
Backend selection guide:
Backend Best For Requirements
pytorch Compatibility Any GPU
pytorch_compile Moderate speedup torch>=2.0
aten Good balance CUDA GPU
torchao_int4 4-bit inference torchao installed
marlin Maximum 4-bit speed Ampere+ GPU
bitblas Flexible bit-widths bitblas installed
HuggingFace integration
Load pre-quantized models
from transformers import AutoModelForCausalLM, AutoTokenizer
Load HQQ-quantized model from Hub
model = AutoModelForCausalLM.from_pretrained( "mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
Use normally
inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=50)
Quantize and save
from transformers import AutoModelForCausalLM, HqqConfig
Quantize
config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" )
Save quantized model
model.save_pretrained("./llama-8b-hqq-4bit")
Push to Hub
model.push_to_hub("my-org/Llama-3.1-8B-HQQ-4bit")
Mixed precision quantization
from transformers import AutoModelForCausalLM, HqqConfig
Different precision per layer type
config = HqqConfig( nbits=4, group_size=64, # Attention layers: higher precision # MLP layers: lower precision for memory savings dynamic_config={ "attn": {"nbits": 4, "group_size": 64}, "mlp": {"nbits": 2, "group_size": 32} } )
vLLM integration
Serve HQQ models with vLLM
from vllm import LLM, SamplingParams
Load HQQ-quantized model
llm = LLM( model="mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit", quantization="hqq", dtype="float16" )
Generate
sampling_params = SamplingParams(temperature=0.7, max_tokens=100) outputs = llm.generate(["What is machine learning?"], sampling_params)
vLLM with custom HQQ config
from vllm import LLM
llm = LLM( model="meta-llama/Llama-3.1-8B", quantization="hqq", quantization_config={ "nbits": 4, "group_size": 64 } )
PEFT/LoRA fine-tuning
Fine-tune quantized models
from transformers import AutoModelForCausalLM, HqqConfig from peft import LoraConfig, get_peft_model
Load quantized model
quant_config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=quant_config, device_map="auto" )
Apply LoRA
lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" )
model = get_peft_model(model, lora_config)
Train normally with Trainer or custom loop
QLoRA-style training
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments( output_dir="./hqq-lora-output", per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, num_train_epochs=3, fp16=True, logging_steps=10, save_strategy="epoch" )
trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, data_collator=data_collator )
trainer.train()
Quantization workflows
Workflow 1: Quick model compression
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
1. Configure quantization
config = HqqConfig(nbits=4, group_size=64)
2. Load and quantize (no calibration needed!)
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
3. Verify quality
prompt = "The capital of France is" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=20) print(tokenizer.decode(outputs[0]))
4. Save
model.save_pretrained("./llama-8b-hqq") tokenizer.save_pretrained("./llama-8b-hqq")
Workflow 2: Optimize for inference speed
from hqq.core.quantize import HQQLinear from transformers import AutoModelForCausalLM, HqqConfig
1. Quantize with optimal backend
config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" )
2. Set fast backend
HQQLinear.set_backend("marlin") # or "torchao_int4"
3. Compile for additional speedup
import torch model = torch.compile(model)
4. Benchmark
import time inputs = tokenizer("Hello", return_tensors="pt").to(model.device) start = time.time() for _ in range(10): model.generate(**inputs, max_new_tokens=100) print(f"Avg time: {(time.time() - start) / 10:.2f}s")
Best practices
-
Start with 4-bit: Best quality/size tradeoff for most models
-
Use group_size=64: Good balance; smaller for extreme quantization
-
Choose backend wisely: Marlin for 4-bit Ampere+, TorchAO for flexibility
-
Verify quality: Always test generation quality after quantization
-
Mixed precision: Keep attention at higher precision, compress MLP more
-
PEFT training: Use LoRA r=16-32 for good fine-tuning results
Common issues
Out of memory during quantization:
Quantize layer-by-layer
from hqq.models.hf.base import AutoHQQHFModel
model = AutoHQQHFModel.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="sequential" # Load layers sequentially )
Slow inference:
Switch to optimized backend
from hqq.core.quantize import HQQLinear HQQLinear.set_backend("marlin") # Requires Ampere+ GPU
Or compile
model = torch.compile(model, mode="reduce-overhead")
Poor quality at 2-bit:
Use smaller group size
config = BaseQuantizeConfig( nbits=2, group_size=16, # Smaller groups help at low bits axis=1 )
References
-
Advanced Usage - Custom backends, mixed precision, optimization
-
Troubleshooting - Common issues, debugging, benchmarks
Resources
-
Repository: https://github.com/mobiusml/hqq
-
Paper: Half-Quadratic Quantization
-
HuggingFace Models: https://huggingface.co/mobiuslabsgmbh
-
Version: 0.2.0+
-
License: Apache 2.0