Back to feed
arXiv cs.AI·

WriteSAE: Sparse Autoencoders for Recurrent State

Signal
82
Hype
15
In three linesWriteSAE introduces the first sparse autoencoder decomposing and editing matrix cache writes in Gated DeltaNet, Mamba-2, and RWKV-7 recurrent models. Factored atoms expose closed-form logit shifts per token, achieving 92.4% successful substitutions across 4,851 firings on Qwen3.5-0.8B and 88.1% on Mamba-2-370M.

## WriteSAE: Opening the Black Box of Matrix Recurrent States

### 1. The Problem Nobody Had Solved

Sparse Autoencoders (SAEs) have become the standard mechanistic interpretability tool since Anthropic's work on superposed features. But they all operate on the same substrate: the residual stream, a vector. This assumption holds for standard Transformers. It breaks down for modern recurrent architectures.

Gated DeltaNet, Mamba-2, and RWKV-7 maintain a matrix cache of dimension $d_k \times d_v$ updated through rank-1 operations of the form $k_t v_t^\top$. No vector atom can represent this structure. Applying a standard residual SAE to these models is equivalent to analyzing an image by reading only its first row of pixels. The relevant activations are elsewhere — in the matrix writes themselves.

### 2. What WriteSAE Actually Does

The architecture factors each SAE decoder atom into the native matrix write shape. Rather than a vector in $\mathbb{R}^d$, each atom is a rank-1 matrix compatible with the cache update mechanism. The core contribution is twofold:

**Closed form for per-token logit shift.** WriteSAE analytically exposes how each atom contributes to output logits, token by token. The $R^2 = 0.98$ between the analytical prediction and empirically measured effects validates that the decomposition captures real causality, not superficial correlation.

**Training under matched Frobenius norm.** The training objective is calibrated so atoms are interchangeable one cache slot at a time, making substitutions surgically precise.

### 3. The Numbers That Matter

On Qwen3.5-0.8B (layer 9, head 4): 92.4% successful substitutions across 4,851 tested activations. The 87-atom population test holds at 89.8%. On Mamba-2-370M: 88.1% over 2,500 firings. These success rates are measured against a matched-norm ablation baseline — the natural control that confirms it is the atom structure, not merely its magnitude, producing the effect.

The behavioral install experiment is the most striking result: sustaining a substitution across three consecutive positions with a $3\times$ lift moves target-in-continuation recall from 33.3% to 100% under greedy decoding. This is the first demonstrated causal intervention at the matrix-recurrent write site.

### 4. Implications and Potential Losers

**For interpretability of recurrent LLMs.** Hybrid models (Mamba, RWKV, DeltaNet) are gaining ground in memory- and latency-constrained deployments. Until now, they remained opaque to standard interpretability tooling. WriteSAE opens a direct analysis channel into their state memory — something residual SAEs could not reach.

**For steering and alignment.** The closed-form logit shift combined with successful behavioral installs suggests WriteSAE is not merely a post-hoc analysis tool. It constitutes a vector for precise behavioral intervention on recurrent models without fine-tuning.

**Potential losers.** Teams that have invested in interpretability pipelines relying exclusively on residual SAEs to audit hybrid models will need to revisit their methodology — their analyses are likely incomplete. More structurally, recurrent architectures had an implicit advantage in opacity: they were less auditable than Transformers. WriteSAE narrows that gap.

The unresolved limitation: experiments target relatively small models (0.8B and 370M parameters). Whether the rank-1 decomposition scales to matrix caches in larger models — where $d_k$ and $d_v$ are substantially larger — remains undemonstrated. The computational complexity of SAE training over matrix spaces grows quadratically with cache dimensions, a non-trivial obstacle for production-scale models.

Read source
Your take?
PapersReasoningEvalsQwen

Summary generated by Claude — human-verified