Back to feed
arXiv cs.LG·

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

Signal
78
Hype
15
In three linesMechRL uses a PPO agent operating over 144 attention heads of GPT-2 small to automatically discover mechanistic circuits. Trained on induction and IOI tasks, the agent identifies causally relevant heads via zero-ablation and contrastive rewards, generalizing to docstring completion (96% of oracle with best-of-five planning).
Read source
Your take?
Reinforcement learningEvalsPapers

Summary generated by Claude — human-verified