arXiv cs.AI·19 May 2026

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

Signal

Hype

In three linesDiagEval is a trajectory-conditioned diagnostic evaluation protocol for GUI agents testing LLM-generated interactive software. It reuses failed trajectories to determine whether failures stem from the evaluator or the software itself. On WebDevJudge-Unit and RealDevBench, DiagEval recovers 45.6-62.1% of false negatives and improves accuracy from 69.9% to 78.3% and from 65.0% to 81.6%.

Read source

Your take?

AI Agents Evals Code generation

Summary generated by Claude — human-verified

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

Other angles on this story