DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents
Signal
72
Hype
18
In three linesDiagEval is a trajectory-conditioned diagnostic evaluation protocol for GUI agents testing LLM-generated interactive software. It reuses failed trajectories to determine whether failures stem from the evaluator or the software itself. On WebDevJudge-Unit and RealDevBench, DiagEval recovers 45.6-62.1% of false negatives and improves accuracy from 69.9% to 78.3% and from 65.0% to 81.6%.Read source
Your take?
Summary generated by Claude — human-verified