DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration
Signal
82
Hype
18
In three linesDeskCraft is a desktop GUI benchmark for agents on long-horizon professional workflows (>50 steps) in design, video, audio, and 3D with human-agent collaboration. 18 agents tested on 538 tasks: GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Reveals persistent failures in proactive clarification and long-horizon workflow delivery.Read source
Your take?
Summary generated by Claude — human-verified