Back to feed
arXiv cs.CL·

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Signal
78
Hype
25
In three linesRTPurbo transforms LLMs into sparse models in ~100 training steps. The approach exploits three observations: only certain heads require full attention, long-range retrieval uses a 16D subspace, and dynamic top-p selection outperforms fixed top-k. Results: 9.36× prefill speedup at 1M tokens, 2.01× decode speedup, accuracy preserved.
Read source
Your take?
ReasoningBenchmarksInfrastructure

Summary generated by Claude — human-verified