arXiv cs.LG·26 May 2026

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Signal

Hype

In three linesAgent-ToM is a learning-to-monitor framework using Theory-of-Mind reasoning to detect covert malicious behavior in autonomous LLM agents. It infers agent beliefs, intent hypotheses, and behavioral deviations from task-consistent baselines. Evaluated on SHADE-Arena and CUA-SHADE-Arena benchmarks, it outperforms ensemble monitoring baselines with a two-call reasoning pipeline.

Read source

Your take?

AI Agents AI safety Reasoning Benchmarks

Summary generated by Claude — human-verified

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Other angles on this story