Reddit r/LocalLLaMA·25 May 2026

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Signal

Hype

In three linesRTPurbo transforms full-attention LLMs into sparse models in hundreds of training steps. The method exploits three observations: only certain heads require full attention, long-range retrieval uses a 16D subspace, and token selection is query-dependent. Results: 9.36x prefill speedup at 1M context, 2.01x decode speedup, accuracy preserved.

Read source

Your take?

Reasoning Benchmarks Infrastructure

Summary generated by Claude — human-verified

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Other angles on this story