hqq-quantization

HQQ - Half-Quadratic Quantization

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "hqq-quantization" with this command: npx skills add zechenzhangagi/ai-research-skills/zechenzhangagi-ai-research-skills-hqq-quantization

HQQ - Half-Quadratic Quantization

Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized backends.

When to use HQQ

Use HQQ when:

  • Quantizing models without calibration data (no dataset needed)

  • Need fast quantization (minutes vs hours for GPTQ/AWQ)

  • Deploying with vLLM or HuggingFace Transformers

  • Fine-tuning quantized models with LoRA/PEFT

  • Experimenting with extreme quantization (2-bit, 1-bit)

Key advantages:

  • No calibration: Quantize any model instantly without sample data

  • Multiple backends: PyTorch, ATEN, TorchAO, Marlin, BitBlas for optimized inference

  • Flexible precision: 8/4/3/2/1-bit with configurable group sizes

  • Framework integration: Native HuggingFace and vLLM support

  • PEFT compatible: Fine-tune quantized models with LoRA

Use alternatives instead:

  • AWQ: Need calibration-based accuracy, production serving

  • GPTQ: Maximum accuracy with calibration data available

  • bitsandbytes: Simple 8-bit/4-bit without custom backends

  • llama.cpp/GGUF: CPU inference, Apple Silicon deployment

Quick start

Installation

pip install hqq

With specific backend

pip install hqq[torch] # PyTorch backend pip install hqq[torchao] # TorchAO int4 backend pip install hqq[bitblas] # BitBlas backend pip install hqq[marlin] # Marlin backend

Basic quantization

from hqq.core.quantize import BaseQuantizeConfig, HQQLinear import torch.nn as nn

Configure quantization

config = BaseQuantizeConfig( nbits=4, # 4-bit quantization group_size=64, # Group size for quantization axis=1 # Quantize along output dimension )

Quantize a linear layer

linear = nn.Linear(4096, 4096) hqq_linear = HQQLinear(linear, config)

Use normally

output = hqq_linear(input_tensor)

Quantize full model with HuggingFace

from transformers import AutoModelForCausalLM, HqqConfig

Configure HQQ

quantization_config = HqqConfig( nbits=4, group_size=64, axis=1 )

Load and quantize

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=quantization_config, device_map="auto" )

Model is quantized and ready to use

Core concepts

Quantization configuration

HQQ uses BaseQuantizeConfig to define quantization parameters:

from hqq.core.quantize import BaseQuantizeConfig

Standard 4-bit config

config_4bit = BaseQuantizeConfig( nbits=4, # Bits per weight (1-8) group_size=64, # Weights per quantization group axis=1 # 0=input dim, 1=output dim )

Aggressive 2-bit config

config_2bit = BaseQuantizeConfig( nbits=2, group_size=16, # Smaller groups for low-bit axis=1 )

Mixed precision per layer type

layer_configs = { "self_attn.q_proj": BaseQuantizeConfig(nbits=4, group_size=64), "self_attn.k_proj": BaseQuantizeConfig(nbits=4, group_size=64), "self_attn.v_proj": BaseQuantizeConfig(nbits=4, group_size=64), "mlp.gate_proj": BaseQuantizeConfig(nbits=2, group_size=32), "mlp.up_proj": BaseQuantizeConfig(nbits=2, group_size=32), "mlp.down_proj": BaseQuantizeConfig(nbits=4, group_size=64), }

HQQLinear layer

The core quantized layer that replaces nn.Linear :

from hqq.core.quantize import HQQLinear import torch

Create quantized layer

linear = torch.nn.Linear(4096, 4096) hqq_layer = HQQLinear(linear, config)

Access quantized weights

W_q = hqq_layer.W_q # Quantized weights scale = hqq_layer.scale # Scale factors zero = hqq_layer.zero # Zero points

Dequantize for inspection

W_dequant = hqq_layer.dequantize()

Backends

HQQ supports multiple inference backends for different hardware:

from hqq.core.quantize import HQQLinear

Available backends

backends = [ "pytorch", # Pure PyTorch (default) "pytorch_compile", # torch.compile optimized "aten", # Custom CUDA kernels "torchao_int4", # TorchAO int4 matmul "gemlite", # GemLite CUDA kernels "bitblas", # BitBlas optimized "marlin", # Marlin 4-bit kernels ]

Set backend globally

HQQLinear.set_backend("torchao_int4")

Or per layer

hqq_layer.set_backend("marlin")

Backend selection guide:

Backend Best For Requirements

pytorch Compatibility Any GPU

pytorch_compile Moderate speedup torch>=2.0

aten Good balance CUDA GPU

torchao_int4 4-bit inference torchao installed

marlin Maximum 4-bit speed Ampere+ GPU

bitblas Flexible bit-widths bitblas installed

HuggingFace integration

Load pre-quantized models

from transformers import AutoModelForCausalLM, AutoTokenizer

Load HQQ-quantized model from Hub

model = AutoModelForCausalLM.from_pretrained( "mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

Use normally

inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=50)

Quantize and save

from transformers import AutoModelForCausalLM, HqqConfig

Quantize

config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" )

Save quantized model

model.save_pretrained("./llama-8b-hqq-4bit")

Push to Hub

model.push_to_hub("my-org/Llama-3.1-8B-HQQ-4bit")

Mixed precision quantization

from transformers import AutoModelForCausalLM, HqqConfig

Different precision per layer type

config = HqqConfig( nbits=4, group_size=64, # Attention layers: higher precision # MLP layers: lower precision for memory savings dynamic_config={ "attn": {"nbits": 4, "group_size": 64}, "mlp": {"nbits": 2, "group_size": 32} } )

vLLM integration

Serve HQQ models with vLLM

from vllm import LLM, SamplingParams

Load HQQ-quantized model

llm = LLM( model="mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit", quantization="hqq", dtype="float16" )

Generate

sampling_params = SamplingParams(temperature=0.7, max_tokens=100) outputs = llm.generate(["What is machine learning?"], sampling_params)

vLLM with custom HQQ config

from vllm import LLM

llm = LLM( model="meta-llama/Llama-3.1-8B", quantization="hqq", quantization_config={ "nbits": 4, "group_size": 64 } )

PEFT/LoRA fine-tuning

Fine-tune quantized models

from transformers import AutoModelForCausalLM, HqqConfig from peft import LoraConfig, get_peft_model

Load quantized model

quant_config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=quant_config, device_map="auto" )

Apply LoRA

lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" )

model = get_peft_model(model, lora_config)

Train normally with Trainer or custom loop

QLoRA-style training

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments( output_dir="./hqq-lora-output", per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, num_train_epochs=3, fp16=True, logging_steps=10, save_strategy="epoch" )

trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, data_collator=data_collator )

trainer.train()

Quantization workflows

Workflow 1: Quick model compression

from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig

1. Configure quantization

config = HqqConfig(nbits=4, group_size=64)

2. Load and quantize (no calibration needed!)

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

3. Verify quality

prompt = "The capital of France is" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=20) print(tokenizer.decode(outputs[0]))

4. Save

model.save_pretrained("./llama-8b-hqq") tokenizer.save_pretrained("./llama-8b-hqq")

Workflow 2: Optimize for inference speed

from hqq.core.quantize import HQQLinear from transformers import AutoModelForCausalLM, HqqConfig

1. Quantize with optimal backend

config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" )

2. Set fast backend

HQQLinear.set_backend("marlin") # or "torchao_int4"

3. Compile for additional speedup

import torch model = torch.compile(model)

4. Benchmark

import time inputs = tokenizer("Hello", return_tensors="pt").to(model.device) start = time.time() for _ in range(10): model.generate(**inputs, max_new_tokens=100) print(f"Avg time: {(time.time() - start) / 10:.2f}s")

Best practices

  • Start with 4-bit: Best quality/size tradeoff for most models

  • Use group_size=64: Good balance; smaller for extreme quantization

  • Choose backend wisely: Marlin for 4-bit Ampere+, TorchAO for flexibility

  • Verify quality: Always test generation quality after quantization

  • Mixed precision: Keep attention at higher precision, compress MLP more

  • PEFT training: Use LoRA r=16-32 for good fine-tuning results

Common issues

Out of memory during quantization:

Quantize layer-by-layer

from hqq.models.hf.base import AutoHQQHFModel

model = AutoHQQHFModel.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="sequential" # Load layers sequentially )

Slow inference:

Switch to optimized backend

from hqq.core.quantize import HQQLinear HQQLinear.set_backend("marlin") # Requires Ampere+ GPU

Or compile

model = torch.compile(model, mode="reduce-overhead")

Poor quality at 2-bit:

Use smaller group size

config = BaseQuantizeConfig( nbits=2, group_size=16, # Smaller groups help at low bits axis=1 )

References

  • Advanced Usage - Custom backends, mixed precision, optimization

  • Troubleshooting - Common issues, debugging, benchmarks

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

ml-paper-writing

No summary provided by upstream source.

Repository SourceNeeds Review
Research

hqq-quantization

No summary provided by upstream source.

Repository SourceNeeds Review
Research

qdrant-vector-search

No summary provided by upstream source.

Repository SourceNeeds Review
Research

crewai-multi-agent

No summary provided by upstream source.

Repository SourceNeeds Review