arXiv cs.LG·25 May 2026

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

Signal

Hype

In three linesThriftAttention uses mixed precision (FP16/FP4) for long-context attention on Blackwell GPUs. By selecting 5% of critical query-key blocks in FP16 and computing remaining blocks in FP4, the method recovers 89.1% of FP16 performance while maintaining FP4 efficiency. Code released.

Read source

Your take?

Benchmarks Infrastructure Reasoning

Summary generated by Claude — human-verified

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

Other angles on this story