ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention
Signal
78
Hype
15
In three linesThriftAttention uses mixed precision (FP16/FP4) for long-context attention on Blackwell GPUs. By selecting 5% of critical query-key blocks in FP16 and computing remaining blocks in FP4, the method recovers 89.1% of FP16 performance while maintaining FP4 efficiency. Code released.Read source
Your take?
Summary generated by Claude — human-verified