onnx-webgpu-converter

Convert HuggingFace transformer models to ONNX format for browser inference with Transformers.js and WebGPU. Use when given a HuggingFace model link to convert to ONNX, when setting up optimum-cli for ONNX export, when quantizing models (fp16, q8, q4) for web deployment, when configuring Transformers.js with WebGPU acceleration, or when troubleshooting ONNX conversion errors. Triggers on mentions of ONNX conversion, Transformers.js, WebGPU inference, optimum export, model quantization for browser, or running ML models in the browser.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "onnx-webgpu-converter" with this command: npx skills add jakerains/agentskills/jakerains-agentskills-onnx-webgpu-converter

ONNX WebGPU Model Converter

Convert any HuggingFace model to ONNX and run it in the browser with Transformers.js + WebGPU.

Workflow Overview

  1. Check if ONNX version already exists on HuggingFace
  2. Set up Python environment with optimum
  3. Export model to ONNX with optimum-cli
  4. Quantize for target deployment (WebGPU vs WASM)
  5. Upload to HuggingFace Hub (optional)
  6. Use in Transformers.js with WebGPU

Step 1: Check for Existing ONNX Models

Before converting, check if the model already has an ONNX version:

If found, skip to Step 6.

Step 2: Environment Setup

# Create venv (recommended)
python -m venv onnx-env && source onnx-env/bin/activate

# Install optimum with ONNX support
pip install "optimum[onnx]" onnxruntime

# For GPU-accelerated export (optional)
pip install onnxruntime-gpu

Verify installation:

optimum-cli export onnx --help

Step 3: Export to ONNX

Basic Export (auto-detect task)

optimum-cli export onnx --model <model_id_or_path> ./output_dir/

With Explicit Task

optimum-cli export onnx \
  --model <model_id> \
  --task <task> \
  ./output_dir/

Common tasks: text-generation, text-classification, feature-extraction, image-classification, automatic-speech-recognition, object-detection, image-segmentation, question-answering, token-classification, zero-shot-classification

For decoder models, append -with-past for KV cache reuse (default behavior): text-generation-with-past, text2text-generation-with-past, automatic-speech-recognition-with-past

Full CLI Reference

FlagDescription
-m MODEL, --model MODELHuggingFace model ID or local path (required)
--task TASKExport task (auto-detected if on Hub)
--opset OPSETONNX opset version (default: auto)
--device DEVICEExport device, cpu (default) or cuda
--optimize {O1,O2,O3,O4}ONNX Runtime optimization level
--monolithForce single ONNX file (vs split encoder/decoder)
--no-post-processSkip post-processing (e.g., decoder merging)
--trust-remote-codeAllow custom model code from Hub
--pad_token_id IDOverride pad token (needed for some models)
--cache_dir DIRCache directory for downloaded models
--batch_size NBatch size for dummy inputs
--sequence_length NSequence length for dummy inputs
--framework {pt}Source framework
--atol ATOLAbsolute tolerance for validation

Optimization Levels

LevelDescription
O1Basic general optimizations
O2Basic + extended + transformer fusions
O3O2 + GELU approximation
O4O3 + mixed precision fp16 (GPU only, requires --device cuda)

Step 4: Quantize for Web Deployment

Quantization Types for Transformers.js

dtypePrecisionBest ForSize Reduction
fp32Full 32-bitMaximum accuracyNone (baseline)
fp16Half 16-bitWebGPU default quality~50%
q8 / int88-bitWASM default, good balance~75%
q4 / bnb44-bitMaximum compression~87%
q4f164-bit weights, fp16 computeWebGPU + small size~87%

Using optimum-cli quantization

# Dynamic quantization (post-export)
optimum-cli onnxruntime quantize \
  --onnx_model ./output_dir/ \
  --avx512 \
  -o ./quantized_dir/

Using Python API for finer control

from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig

model = ORTModelForSequenceClassification.from_pretrained("./output_dir/")
quantizer = ORTQuantizer.from_pretrained(model)
config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer.quantize(save_dir="./quantized_dir/", quantization_config=config)

Producing Multiple dtype Variants for Transformers.js

To provide fp32, fp16, q8, and q4 variants (like onnx-community models), organize output as:

model_onnx/
├── onnx/
│   ├── model.onnx              # fp32
│   ├── model_fp16.onnx         # fp16
│   ├── model_quantized.onnx    # q8
│   └── model_q4.onnx           # q4
├── config.json
├── tokenizer.json
└── tokenizer_config.json

Step 5: Upload to HuggingFace Hub (Optional)

# Login
huggingface-cli login

# Upload
huggingface-cli upload <your-username>/<model-name>-onnx ./output_dir/

# Add transformers.js tag to model card for discoverability

Step 6: Use in Transformers.js with WebGPU

Install

npm install @huggingface/transformers

Basic Pipeline with WebGPU

import { pipeline } from "@huggingface/transformers";

const pipe = await pipeline("task-name", "model-id-or-path", {
  device: "webgpu",    // GPU acceleration
  dtype: "q4",         // Quantization level
});

const result = await pipe("input text");

Per-Module dtypes (encoder-decoder models)

Some models (Whisper, Florence-2) need different quantization per component:

const model = await Florence2ForConditionalGeneration.from_pretrained(
  "onnx-community/Florence-2-base-ft",
  {
    dtype: {
      embed_tokens: "fp16",
      vision_encoder: "fp16",
      encoder_model: "q4",
      decoder_model_merged: "q4",
    },
    device: "webgpu",
  },
);

For detailed Transformers.js WebGPU usage patterns: See references/webgpu-usage.md

Troubleshooting

For conversion errors and common issues: See references/conversion-guide.md

Quick Fixes

  • "Task not found": Use --task flag explicitly. For decoder models try text-generation-with-past
  • "trust_remote_code": Add --trust-remote-code flag for custom model architectures
  • Out of memory: Use --device cpu and smaller --batch_size
  • Validation fails: Try --no-post-process or increase --atol
  • Model not supported: Check supported architectures — 120+ architectures supported
  • WebGPU fallback to WASM: Ensure browser supports WebGPU (Chrome 113+, Edge 113+)

Supported Task → Pipeline Mapping

TaskTransformers.js PipelineExample Model
text-classificationsentiment-analysisdistilbert-base-uncased-finetuned-sst-2
text-generationtext-generationQwen2.5-0.5B-Instruct
feature-extractionfeature-extractionmxbai-embed-xsmall-v1
automatic-speech-recognitionautomatic-speech-recognitionwhisper-tiny.en
image-classificationimage-classificationmobilenetv4_conv_small
object-detectionobject-detectiondetr-resnet-50
image-segmentationimage-segmentationsegformer-b0
zero-shot-image-classificationzero-shot-image-classificationclip-vit-base-patch32
depth-estimationdepth-estimationdepth-anything-small
translationtranslationnllb-200-distilled-600M
summarizationsummarizationbart-large-cnn

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

nextjs-pwa

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

elevenlabs

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

vercel-workflow

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

sam3

No summary provided by upstream source.

Repository SourceNeeds Review