The Weight Norm Sets the Grokking Timescale: A Causal Delay Law
In three linesCausal study on grokking: the delay before generalization depends on weight norm. Under free weight decay, networks grok at a stable critical norm Wc (CV 1–2%). When norm is clamped to ρ×Wc, delay follows T_grok ∝ exp(α·ρ) with α≈7.5 (R²=0.996 across 4 moduli). Norm controls delay 19× more than learning rate.
## Grokking: weight norm as a causal clock
### 1. What was disputed — and why it mattered
Since Power et al. (2022), grokking — generalization that emerges thousands of steps after perfect training memorization — has produced contradictory mechanistic accounts. One camp observed a critical weight norm at the transition; another reported grokking with no identifiable fixed norm. The disagreement was not academic: if the norm is merely an observational correlate, strategies built around aggressive weight decay rest on a false intuition. If it is causal, it is a direct control knob.
The methodological flaw was standard: all prior studies *observed* the norm without *intervening* on it. Correlation ≠ causation, even in deep learning.
### 2. The causal intervention: norm pinning
arXiv:2606.13753 resolves this by applying direct intervention logic. Instead of letting the norm evolve freely under weight decay, the authors *clamp* it to a target value ρ × Wc throughout training, where Wc is the critical norm measured under free training.
Key results:
- **Wc is remarkably stable**: coefficient of variation 1–2% across seeds and learning rates. Not a single-run artifact. - **Wc follows a power law** with the modular base (the arithmetic task hyperparameter modulo p). - **With norm clamped to ρ × Wc**, the grokking delay follows T_grok ∝ exp(α·ρ), with α ≈ 7.5, fit across four moduli with R² = 0.996. One universal exponent. - **Lever comparison**: over the swept ranges, the pinned norm shifts the delay by ~19×; the learning rate shifts it by only ~2×. The norm dominates. - **Norm above Wc slows, does not prevent grokking**: generalization eventually arrives, but exponentially later.
### 3. The LayerNorm control — and what it reveals
The most informative control experiment: adding a final LayerNorm *removes* the exponential dependence. Mechanism: LayerNorm decouples weight scale from the network function. The norm does not act directly on the loss or gradients as a scalar — it acts through its effect on the *function* the network implements. Without LayerNorm, removing that decoupling restores the exponential law.
This also clarifies the relationship with prior theoretical work: the logarithmic delay predicted for a *freely contracting* norm (under pure weight decay) is the counterpart to the exponential law measured here for a *pinned* norm. Both are consistent within a unified framework.
### 4. Practical implications and potential losers
**For practitioners trying to accelerate grokking** (or understand delayed generalization in their own models): the actionable lever is not the learning rate — it is the norm. Weight decay calibrated to reach Wc quickly is ~9× more effective than tuning LR within the tested ranges.
**For generalization research**: the law T_grok ∝ exp(7.5·ρ) is precise enough (R²=0.996 across 4 configurations) to serve as a reference benchmark. Any theoretical model of grokking that fails to reproduce this exponential dependence with this exponent is incomplete.
**Potential losers**: - Grokking explanations centered on *circuit structure* (the network must "discover" a modular algorithm) without reference to the norm are weakened: the norm alone, held constant, is sufficient to modulate the delay by 19×. - Approaches that use LayerNorm by default in grokking experiments may have masked this causal signal — their negative results on norm dependence are now explicable. - Work proposing learning rate as the primary delay control lever is relativized by the 19×/2× ratio.
**Limitation to note**: experiments use modular arithmetic (the canonical grokking task), not production architectures. Transferability to transformers trained on natural language remains to be established. But the causal rigor — direct intervention, LayerNorm control, universality of exponent α across multiple moduli — places this result above the typical correlational grokking study.
Summary generated by Claude — human-verified