arXiv cs.AI·29 May 2026

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

Signal

Hype

In three linesOpenClawBench is a dataset of 31,264 annotated trajectories to detect process-side anomalies in agent execution beyond task success. Among 31,135 passing executions, 2,904 contain anomalies (unresolved ambiguity, unsafe writes, ignored errors). A fine-tuned Gemma 3 12B detector reaches F1=0.729.

Read source

Your take?

AI Agents Benchmarks Evals AI safety Gemini

Summary generated by Claude — human-verified

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

Other angles on this story