Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
Signal
78
Hype
25
In three linesRTPurbo transforms full-attention LLMs into sparse models in hundreds of training steps. The method exploits three observations: only certain heads require full attention, long-range retrieval uses a 16D subspace, and token selection is query-dependent. Results: 9.36x prefill speedup at 1M context, 2.01x decode speedup, accuracy preserved.Read source
Your take?
Summary generated by Claude — human-verified