arXiv cs.CL·19 May 2026

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Signal

Hype

In three linesRTPurbo transforms LLMs into sparse models in ~100 training steps. The approach exploits three observations: only certain heads require full attention, long-range retrieval uses a 16D subspace, and dynamic top-p selection outperforms fixed top-k. Results: 9.36× prefill speedup at 1M tokens, 2.01× decode speedup, accuracy preserved.

Read source

Your take?

Reasoning Benchmarks Infrastructure

Summary generated by Claude — human-verified

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Other angles on this story