Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression
In three linesHQMQ, a calibration-free KV cache compression method for LLMs, quantizes each 4-element chunk as a Hurwitz quaternion. Tested on Mistral-7B, Llama-3-8B, Qwen2.5/3-8B, and gpt-oss-20b: matches fp16 quality at ~5 bits, achieves up to 5.05× compression (Llama-3-70B: 43 GB → 8.5 GB), outperforms naive int4 by 3–1900×.
## HQMQ: Calibration-Free KV Cache Compression via Hurwitz Quaternions
### What's actually happening
The KV cache has become one of the most expensive bottlenecks in long-context LLM inference. For Llama-3-70B at 128k context, this cache consumes 43 GB in fp16 — often exceeding model weights on certain hardware configurations. HQMQ (Hurwitz Quaternion Multiplicative Quantization) achieves up to 5.05× compression of this cache, reducing those 43 GB to 8.5 GB, with no calibration data required.
### The mechanics: why Hurwitz quaternions
The core idea is geometric. Each 4-element chunk of a K or V vector is treated as a unit quaternion on S³. Quantization decomposes this quaternion into a product q_p · q_s, where q_p belongs to the Hurwitz group 2T (24 vertices of the 24-cell, minimum pairwise angle 60°) and q_s is drawn from a per-(layer, head) secondary codebook of S random unit quaternions.
The mathematical trick exploited: left-multiplication by a unit quaternion is an S³ isometry. This means the 24 elements of 2T uniformly "rotate" the S secondary codebook vectors, producing 24S effective codewords for only S stored parameters. In practice, S=8 yields 192 effective codewords at ~3.79 bits; S=32 yields 768 codewords at ~5 bits.
Direct consequence: random initialization of the secondary codebook suffices. Perplexity variance across different seeds stays below 1.5%, eliminating the need for calibration — unlike KIVI, QuIP#, or other methods requiring a representative calibration corpus.
### The outlier problem: Med3×
Modern architectures (Qwen2.5, Qwen3) exhibit severe outliers in KV activations. Without treatment, naive int4 collapses to perplexities above 10⁴ on these models — rendering them unusable. HQMQ integrates a per-batch median-multiplier outlier extraction step (C=3, no calibration) that recovers fp16 quality within ±0.02–0.10 perplexity points at ~5 bits on Qwen2.5-7B and Qwen3-8B.
### The numbers that matter
**vs. naive int4**: HQMQ Pareto-dominates naive int4 by 3× to 1900× depending on model and bit budget. The 1900× factor corresponds to Qwen cases where int4 completely collapses.
**vs. KIVI-4** (strongest calibrated baseline): At 3.79 bits (16% fewer bits than KIVI-4 at ~4.5 bits), HQMQ stays within ≤1 pt on CoQA, ≤0.6 pt on TruthfulQA, ≤2.3 pts on GSM8K — without a calibration pass. On Mistral-7B, zero-shot accuracy at 3.79 bits matches fp16.
**Absolute quality**: On Mistral-7B and Qwen3-8B, HQMQ reaches fp16 within ±0.02–0.03 perplexity points at ~5 bits. That's a near-zero margin for 3× cache compression.
### Who loses in this scenario
**Vendors of KV calibration solutions**: KIVI, SmoothQuant-KV, and similar methods justified their operational complexity (calibration data collection, additional passes) through superior quality. HQMQ reduces that advantage to a few points on specific benchmarks while eliminating the calibration constraint entirely.
**Pure int4/int8 approaches**: On outlier-heavy architectures (Qwen2.5, Qwen3, and likely future models), naive int4 becomes unusable. HQMQ offers a structured alternative that handles these cases without tuning.
**Memory-constrained deployments**: The impact is largest for long-context scenarios. At 128k tokens on Llama-3-70B, going from 43 GB to 8.5 GB of KV cache can mean the difference between deploying on 2×A100 80GB versus 4×A100 — a 2× infrastructure cost factor.
### Limitations and open questions
The method is evaluated on five models (Mistral-7B, Llama-3-8B, Qwen2.5-7B, Qwen3-8B, gpt-oss-20b MoE). Extension to models with sliding window attention or hybrid architectures (Mamba, RWKV) is not covered. Decompression latency at inference time is not quantified in the abstract — a critical point for real-time applications where KV compression must come with acceptable decoding overhead. The absence of calibration is a strong operational advantage, but the per-(layer, head) secondary codebook introduces metadata storage overhead that needs evaluation at very large model scales.
Summary generated by Claude — human-verified