Back to feed
arXiv cs.AI·

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

Signal
78
Hype
25
In three linesOpenClawBench is a dataset of 31,264 annotated trajectories to detect process-side anomalies in agent execution beyond task success. Among 31,135 passing executions, 2,904 contain anomalies (unresolved ambiguity, unsafe writes, ignored errors). A fine-tuned Gemma 3 12B detector reaches F1=0.729.
Read source
Your take?
AI AgentsBenchmarksEvalsAI safetyGemini

Summary generated by Claude — human-verified