Reddit r/MachineLearning·25 May 2026

𝐃𝐞𝐥𝐭𝐚 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐑𝐞𝐬𝐢𝐝𝐮𝐚𝐥𝐬 [R]

Signal

Hype

In three linesDelta Attention Residuals improves residual connections by routing over layer deltas (vᵢ = hᵢ₊₁ − hᵢ) instead of cumulative hidden states. Results: −8.2% PPL at 7.6B, 1.8× sharper cross-layer routing (max weight 0.2→0.6), <0.01% parameter overhead. Code and paper released.

## Delta Attention Residuals: fixing cross-layer routing collapse at near-zero cost

### 1. The problem that wasn't actually solved

Standard residual connections (h_{i+1} = h_i + f(h_i)) are the backbone of every modern transformer. Since 2023, several works have pushed further with *Attention Residuals*: instead of simply adding the previous layer, they dynamically route over all past hidden states via a cross-layer attention mechanism. The intuition is sound — let the model choose which intermediate representation to reuse.

The problem: at scale, this routing collapses. Cumulative hidden states h_i are structurally redundant — each layer adds a small perturbation to an already semantically saturated vector. The result: cross-layer attention converges to near-uniform distribution, with a maximum weight of ~0.2 in deep layers. The selection mechanism becomes useless. Worse: at 7.6B parameters, Attention Residuals actually degrade perplexity below the standard baseline (18.58 vs 17.43), meaning the architectural cost is not offset by any gain.

### 2. The delta mechanics

Delta Attention Residuals replaces cumulative hidden states with their inter-layer differences: v_i = h_{i+1} − h_i. These deltas represent the net contribution of each sublayer — what it actually changed, not the accumulation of everything before it.

Why this matters: deltas are structurally diverse. Some layers perform syntactic transformations, others semantic ones, others denoising. This structural diversity prevents routing collapse. The maximum attention weight goes from ~0.2 to ~0.6 (0.62 vs 0.35 average), a 1.8× sharpness improvement. The model genuinely learns to select relevant past contributions rather than averaging uniformly.

Initialization is critical: additive routing is zero-initialized, guaranteeing the module is an identity at the start. No perturbation to the base checkpoint at initialization.

### 3. The numbers that matter

**Validation perplexity**: 1.7% to 8.2% gains depending on scale, from 220M to 7.6B parameters. At 7.6B, −8.2% PPL versus standard Attention Residuals, and a meaningful relative gain over classic Attention Residuals which degrade at this scale (18.58 → 17.43 for standard baseline, DAR goes lower still).

**Parameter overhead**: 589K additional parameters for an 8B model, i.e. 0.008%. Memory increases by ~3%. On throughput, DAR runs at 14.0k tok/s versus 12.5k tok/s for Attention Residuals — DAR is both more accurate and faster than its direct predecessor.

**Fine-tuning existing checkpoints**: Qwen3-0.6B converted to DAR via standard fine-tuning beats the original on 8 downstream benchmarks (aggregate score 55.6 vs 55.0). This is the most immediately actionable result: no need to pretrain from scratch.

### 4. Winners, losers, open questions

**Direct winners**: teams pretraining models in the 1B–10B range looking for perplexity gains without significant compute or parameter budget increases. The drop-in on existing checkpoints reduces the barrier to a few GPU-days of fine-tuning.

**Potential losers**: work on classic Attention Residuals (notably 2023–2024 papers proposing this mechanism as a general improvement) sees its approach invalidated at scale. If DAR holds up on models >10B and on MoE architectures, the case for standard AR disappears.

**What remains to be established**: the benchmarks cover 220M–7.6B, all dense. Extrapolation to 70B+ and MoE architectures (where inter-layer deltas have different properties depending on which experts are activated) is not documented. Robustness to specialized domains (code, math, multilingual) is also not independently evaluated. The Qwen3-0.6B fine-tuning result is promising but 0.6B is a regime where many techniques work without generalizing. The code is public (GitHub) and the paper is on arXiv (2605.18855). For practitioners, the immediate test is converting an existing checkpoint and measuring perplexity on their target domain — the overhead is low enough to make the experiment cheap.

Read source

Your take?

Papers Benchmarks Open source

Summary generated by Claude — human-verified

𝐃𝐞𝐥𝐭𝐚 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐑𝐞𝐬𝐢𝐝𝐮𝐚𝐥𝐬 [R]

Other angles on this story