Back to feed
Reddit r/LocalLLaMA·

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

Signal
78
Hype
15
In three linesRTX 5080 16GB benchmark with Qwen3.6 35B MoE at 128k context: 56 tok/s without MTP, 74 tok/s with MTP but slower overall. MTP forces a 1.5GB buffer that offloads 3 expert layers GPU→CPU, creating a bottleneck. The 27B IQ3 reaches 73 tok/s and fits entirely on GPU.
Read source
Your take?
QwenBenchmarksOpen sourceInfrastructure

Summary generated by Claude — human-verified