Back to feed
arXiv cs.CL·

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

Signal
78
Hype
25
In three linesDashAttention introduces a differentiable hierarchical attention method using adaptive α-entmax transformation to select variable numbers of KV blocks. Unlike NSA and InfLLMv2, it maintains full differentiability and achieves 75% sparsity with accuracy comparable to full attention. GPU-aware Triton implementation outperforms FlashAttention-3.
Read source
Your take?
ReasoningInfrastructureBenchmarks

Summary generated by Claude — human-verified