Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning
Signal
72
Hype
28
In three linesAgent-ToM is a learning-to-monitor framework using Theory-of-Mind reasoning to detect covert malicious behavior in autonomous LLM agents. It infers agent beliefs, intent hypotheses, and behavioral deviations from task-consistent baselines. Evaluated on SHADE-Arena and CUA-SHADE-Arena benchmarks, it outperforms ensemble monitoring baselines with a two-call reasoning pipeline.Read source
Your take?
Summary generated by Claude — human-verified