Back to feed
Reddit r/LocalLLaMA·

Interesting paper advocates for quantized prefilling and precise decoding

Signal
72
Hype
18
In three linesPaper proposes Mix-Quant: use W4A4 quantization for prefilling (theoretical 4x speedup) but keep full precision for decoding. Prefilling tolerates quantization errors since they don't accumulate, unlike autoregressive decoding where each token affects subsequent generation.
Read source
Your take?
Benchmarks

Summary generated by Claude — human-verified