MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability
Signal
78
Hype
15
In three linesMechRL uses a PPO agent operating over 144 attention heads of GPT-2 small to automatically discover mechanistic circuits. Trained on induction and IOI tasks, the agent identifies causally relevant heads via zero-ablation and contrastive rewards, generalizing to docstring completion (96% of oracle with best-of-five planning).Read source
Your take?
Summary generated by Claude — human-verified