OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories
Signal
78
Hype
25
In three linesOpenClawBench is a dataset of 31,264 annotated trajectories to detect process-side anomalies in agent execution beyond task success. Among 31,135 passing executions, 2,904 contain anomalies (unresolved ambiguity, unsafe writes, ignored errors). A fine-tuned Gemma 3 12B detector reaches F1=0.729.Read source
Your take?
Summary generated by Claude — human-verified