arXiv cs.AI·19 May 2026

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

Signal

Hype

In three linesDashAttention introduces a differentiable hierarchical sparse attention method using adaptive α-entmax transformation to select variable numbers of KV blocks. Unlike NSA and InfLLMv2, it maintains full differentiability and achieves 75% sparsity with accuracy comparable to full attention. GPU-aware Triton implementation provides significant speedup.

Read source

Your take?

Reasoning Benchmarks Infrastructure

Summary generated by Claude — human-verified

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

Other angles on this story