Back to feed
arXiv cs.AI·

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

Signal
72
Hype
18
In three linesDiagEval is a trajectory-conditioned diagnostic evaluation protocol for GUI agents testing LLM-generated interactive software. It reuses failed trajectories to determine whether failures stem from the evaluator or the software itself. On WebDevJudge-Unit and RealDevBench, DiagEval recovers 45.6-62.1% of false negatives and improves accuracy from 69.9% to 78.3% and from 65.0% to 81.6%.
Read source
Your take?
AI AgentsEvalsCode generation

Summary generated by Claude — human-verified