DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention
Signal
78
Hype
25
In three linesDashAttention introduces a differentiable hierarchical sparse attention method using adaptive α-entmax transformation to select variable numbers of KV blocks. Unlike NSA and InfLLMv2, it maintains full differentiability and achieves 75% sparsity with accuracy comparable to full attention. GPU-aware Triton implementation provides significant speedup.Read source
Your take?
Summary generated by Claude — human-verified