Back to feed
Reddit r/LocalLLaMA·

llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig

Signal
82
Hype
18
In three linesMTP (speculative decoding) support merged into llama.cpp (PR #22673, May 16). Qwen 3.6 27B benchmarks: 1.81×–2.44× speedup on Strix Halo (ROCm), 1.54×–2.17× on RTX 3090. MoE 35B-A3B shows smaller gains (1.24×–1.40×). Enable with --spec-type draft-mtp --spec-draft-n-max N.

## MTP in llama.cpp: what the numbers actually mean

### What landed

PR #22673 (commit 4f13cb7), merged May 16, brings **Multi-Token Prediction (MTP) speculative decoding** into llama.cpp mainline. This is not an experimental fork or external patch — it's in the main branch, enabled via two flags (`--spec-type draft-mtp --spec-draft-n-max N`). Output is byte-identical to baseline at the same seed and temperature: no quality regression, no tradeoff.

Before this PR, speculative decoding in llama.cpp required a separate draft model (classic Medusa/draft-model approach), meaning a second checkpoint to load, two memory contexts to manage, and meaningful deployment complexity. MTP leverages multi-token prediction heads already baked into certain models (Qwen3 here) — no external model required.

### Raw numbers and what they mean

On **Qwen3.6 27B** (dense), gains are substantial:

- **Strix Halo / ROCm 7.0.2**: Q4_K_M goes from 11.7 to 21.2 tok/s (×1.81); Q8_0 from 7.4 to 18.1 tok/s (×2.44) - **Single RTX 3090 @ 450W / CUDA 12.9**: Q4_K_M from 38.7 to 59.5 tok/s (×1.54, n=2) - **Dual RTX 3090 layer-split**: Q8_0 from 25.7 to 55.9 tok/s (×2.17, n=3)

The ×2.44 on Strix Halo at Q8_0 is particularly meaningful: it moves from a throughput that made local inference painful (7.4 tok/s) to something usable in real time (18.1 tok/s). The Strix Halo (AMD APU, unified LPDDR5X memory) benefits more than the 3090 because its bottleneck is memory bandwidth — MTP reduces forward pass count, which disproportionately relieves that bandwidth pressure.

Optimal `n` varies by rig: uncapped 3090 prefers n=2 at Q4, while the capped 3090 and Strix Halo prefer n=3. This is not a minor detail — choosing the wrong n can erase a portion of the gain.

### Why MoE benefits less

On **Qwen3.6 35B-A3B** (MoE), gains are noticeably smaller: ×1.40 on Strix Halo (49.5 → 69.4 tok/s), ×1.24 on RTX 3090 (120.0 → 148.3 tok/s). The explanation is mechanical: with only ~3B active parameters per token out of 35B total, each forward pass is already cheap. MTP saves N-1 forward passes — but if each pass costs almost nothing, the relative saving is small. MoE architectures are structurally less compatible with speculative decoding than dense models.

Important context: the MoE baseline throughput is already very high (120 tok/s on a single 3090), so the practical utility of the marginal gain is different — you're already well above comfort threshold for most use cases.

### The power limit context that changes the reference frame

The author discloses that previous 3090 benchmarks were skewed by an undisclosed 200W cap (breaker issue with 4 cards on one circuit). Re-benchmarks at 350W and 450W show gains of **+70% to +113%** on dense 27-32B models. This deserves attention: a significant fraction of llama.cpp benchmarks published online are likely affected by implicit, undocumented power limits, making cross-rig comparisons unreliable without this information.

### Who loses here

Speculative decoding solutions relying on external draft models lose their main differentiator for models that ship native MTP heads. Cloud inference providers monetizing reduced latency through proprietary solutions see that advantage erode for local users. Alternative frameworks (vLLM, llama-cpp-python wrappers) will need to integrate this PR to remain competitive on throughput benchmarks.

### What to watch

MTP compatibility is currently documented on Qwen3. Extension to other architectures (Llama 3.x, Mistral, Gemma) depends on whether MTP heads are present in the checkpoints — this is not universal. The real question for the coming weeks: which published models include usable MTP heads, and will labs systematize this practice in their open-weight releases?

Read source
Your take?
LlamaCode generationBenchmarksOpen sourceInfrastructure

Summary generated by Claude — human-verified