NOWAIT Reasoning Optimizer
Implements the NOWAIT technique from the paper "Wait, We Don't Need to 'Wait'! Removing Thinking Tokens Improves Reasoning Efficiency" (Wang et al., 2025).
Overview
NOWAIT is a training-free inference-time intervention that suppresses self-reflection tokens (e.g., "Wait", "Hmm", "Alternatively") during generation, reducing chain-of-thought (CoT) trajectory length by 27-51% without compromising model utility.
When to Use
-
Deploying R1-style reasoning models with limited compute
-
Reducing inference latency for production systems
-
Optimizing token costs for reasoning tasks
-
Working with verbose CoT outputs that need streamlining
Supported Models
Model Series Type Token Reduction
QwQ-32B RL-based 16-31%
Phi4-Reasoning-Plus RL-based 23-28%
Qwen3-32B RL-based 13-16%
Kimi-VL-A3B Multimodal 40-60%
QvQ-72B-Preview Multimodal 20-30%
Important: NOWAIT works best with RL-based models. Distilled models (Qwen3-4B/8B/14B) show degraded performance when reflection tokens are suppressed.
Quick Start
- Basic Implementation
from scripts.nowait_processor import NOWAITLogitProcessor
Initialize processor for your model's tokenizer
processor = NOWAITLogitProcessor(tokenizer)
Use during generation
outputs = model.generate( inputs, logits_processor=[processor], max_new_tokens=32768 )
- Keywords Suppressed
See references/keywords.md for the complete list. Core keywords:
wait, alternatively, hmm, but, however, check, double-check, maybe, verify, again, oh, ah
How It Works
-
Initialize Keywords: Identify reflection keywords from empirical analysis
-
Expand to Token Variants: Map keywords to all token variants in vocabulary (e.g., "wait" → " wait", "Wait", " Wait", ".wait", "WAIT")
-
Suppress During Inference: Set logits of reflection tokens to large negative values during decoding
Logits (Before) Logits (After) Wait 0.8 → Wait -inf First 0.6 → First 0.6 Hmm 0.5 → Hmm -inf Let 0.4 → Let 0.4
Key Findings
Why It Works
-
NOWAIT doesn't eliminate self-reflection entirely—it guides models to skip unnecessary "waiting" reasoning
-
Models still perform essential verification at key decision points
-
Results in more linear, straightforward reasoning paths
RL vs Distilled Models
Model Type NOWAIT Effect Recommendation
RL-based (QwQ, Phi4, Qwen3-32B) Stable accuracy, significant token reduction ✅ Recommended
Distilled (Qwen3-4B/8B/14B) Accuracy degradation on hard tasks ⚠️ Use with caution
Distilled models rely heavily on CoT structure from training data—removing reflection tokens disrupts their reasoning patterns.
Integration Examples
HuggingFace Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer from scripts.nowait_processor import NOWAITLogitProcessor
model = AutoModelForCausalLM.from_pretrained("Qwen/QwQ-32B") tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")
processor = NOWAITLogitProcessor(tokenizer)
response = model.generate( tokenizer(prompt, return_tensors="pt").input_ids, logits_processor=[processor], max_new_tokens=32768, do_sample=True, temperature=0.7 )
vLLM
from vllm import LLM, SamplingParams from scripts.nowait_processor import get_nowait_bad_words_ids
llm = LLM(model="Qwen/QwQ-32B") bad_words_ids = get_nowait_bad_words_ids(llm.get_tokenizer())
sampling_params = SamplingParams( max_tokens=32768, bad_words_ids=bad_words_ids )
Expected Results
Task Type Original Tokens NOWAIT Tokens Reduction
Math (AIME) 15,000 10,500 30%
Visual QA (MMMU) 2,900 1,450 50%
Video QA (MMVU) 1,700 1,250 27%
Limitations
-
Less effective on very simple problems where CoT overhead is already minimal
-
Distilled models may suffer accuracy loss on challenging tasks
-
Some domains may require model-specific keyword tuning
References
-
Paper: arXiv:2506.08343v2
-
Complete keyword list: references/keywords.md
-
Implementation: scripts/nowait_processor.py