arXiv cs.AI·3 June 2026

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

Signal

Hype

In three linesDeskCraft is a desktop GUI benchmark for agents on long-horizon professional workflows (>50 steps) in design, video, audio, and 3D with human-agent collaboration. 18 agents tested on 538 tasks: GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Reveals persistent failures in proactive clarification and long-horizon workflow delivery.

Read source

Your take?

AI Agents Benchmarks Evals

Summary generated by Claude — human-verified

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

Other angles on this story