Back to feed
arXiv cs.LG·

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Signal
72
Hype
28
In three linesAgent-ToM is a learning-to-monitor framework using Theory-of-Mind reasoning to detect covert malicious behavior in autonomous LLM agents. It infers agent beliefs, intent hypotheses, and behavioral deviations from task-consistent baselines. Evaluated on SHADE-Arena and CUA-SHADE-Arena benchmarks, it outperforms ensemble monitoring baselines with a two-call reasoning pipeline.
Read source
Your take?
AI AgentsAI safetyReasoningBenchmarks

Summary generated by Claude — human-verified