BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization
In three linesBitsMoE introduces spectral-energy-guided bit allocation for MoE LLM quantization. Using SVD decomposition, it preserves shared basis unquantized and fine-grained quantizes expert-specific factors via integer linear programming. On Qwen3-30B at 2-bit, it improves accuracy by 27.83 percentage points and increases decoding speed 1.76× over GPTQ.
## BitsMoE: Ultra-Low-Bit Quantization for MoE Models, Finally Workable
### 1. The Concrete Problem
Mixture-of-Experts (MoE) models reduce per-token compute through sparse expert activation, but their memory footprint scales with total parameter count — all expert weights must reside in memory simultaneously. Qwen3-30B-A3B activates ~3B parameters per token but stores ~30B. 2-bit quantization is the natural path to reducing this footprint, but existing approaches break down at this precision.
GPTQ, the dominant baseline, applies coarse-grained quantization that ignores two structural realities of MoE architectures: (1) experts share a common representational basis, and (2) weight-direction importance varies massively across experts. The result on Qwen3-30B at 2 bits with GPTQ is accuracy degradation severe enough to render the model unusable on downstream benchmarks.
### 2. What BitsMoE Does Differently
The core insight is structural: decompose each MoE layer via SVD into a **shared basis** (common across all experts) and **expert-specific spectral factors**. The shared basis is kept at full precision — it encodes cross-expert structure that, if quantized, degrades all experts uniformly. Only the expert-specific factors are quantized, at mixed precision.
Bit-width assignment per unit is formulated as an integer linear program (ILP): minimize estimated reconstruction loss under a fixed bit budget. Reconstruction is activation-aware, enabling the allocation of more bits to high-energy spectral directions — those contributing most to model output.
This simultaneously addresses two failure modes: inter-expert redundancy (captured and preserved in the shared basis) and intra-layer heterogeneity (handled by mixed-precision allocation).
### 3. The Numbers That Matter
On **Qwen3-30B-A3B-Base at 2-bit quantization**: - **+27.83 percentage points average accuracy** on downstream tasks vs. GPTQ — a gap that converts an unusable model into a deployable one - **1.76× decoding speedup** vs. GPTQ, driven by effective memory bandwidth reduction - **12.3× faster quantization process** vs. GPTQ — compression itself is dramatically faster
The 12.3× quantization speedup deserves attention: it means BitsMoE can be applied without massive GPU compute infrastructure, unlike calibration-heavy quantization methods that require hours of GPU time.
Experiments cover "multiple MoE LLMs" per the abstract, though Qwen3-30B is the most documented case in available material. Generalization to other MoE architectures (Mixtral, DeepSeek-MoE) requires verification against the full paper's tables.
### 4. Winners and Losers
**Direct winners**: teams deploying MoE models on constrained hardware (edge servers, low-cost inference). At 2 bits, Qwen3-30B drops from ~60GB (FP16) to ~7-8GB theoretical, making single A100-40GB or even consumer GPU deployment plausible. Code and models are publicly available on GitHub.
**GPTQ as the losing baseline**: a 27.83-point gap on a single architecture is devastating for GPTQ in the ultra-low-bit MoE context. GPTQ remains relevant for dense models and 4-bit precision, but its position on 2-bit MoE quantization is now hard to defend.
**Potential losers**: vendors of proprietary quantization solutions that haven't integrated MoE-specific spectral structure into their pipelines. AWQ and GPTQ variants without MoE-specific adaptation face the same exposure.
**Key blind spot**: SVD decomposition introduces preprocessing overhead and potentially increases checkpoint size — the full-precision shared basis adds to the quantized factors. The abstract does not quantify this storage overhead, which is a significant gap for practical evaluation. Performance on long-form generation or complex reasoning tasks (vs. average accuracy on short benchmarks) also remains undocumented.
Summary generated by Claude — human-verified