arXiv cs.LG·27 May 2026

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

Signal

Hype

In three linesMechRL uses a PPO agent operating over 144 attention heads of GPT-2 small to automatically discover mechanistic circuits. Trained on induction and IOI tasks, the agent identifies causally relevant heads via zero-ablation and contrastive rewards, generalizing to docstring completion (96% of oracle with best-of-five planning).

Read source

Your take?

Reinforcement learning Evals Papers

Summary generated by Claude — human-verified

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

Other angles on this story