Back to feed
arXiv cs.AI·

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

Signal
82
Hype
15
In three linesStudy on latency of computer-use agents on OSWorld: LLM calls for planning and reflection dominate total time. 16 agents tested require 2.7–4.3× more steps than optimal human trajectories. Each successive step takes 3× longer than initial steps.

## OSWorld-Human: when latency makes computer-use agents practically unusable

### 1. What is being measured — and why it was missing

OSWorld is the flagship benchmark for computer-use agents. Since its release, the field has optimized along a single axis: task success rate. OSWorld-Human (arXiv:2506.16042) introduces an orthogonal, previously ignored dimension: temporal efficiency. This is the first systematic study of end-to-end latency on this benchmark.

The baseline observation is stark: tasks humans complete in a few minutes take **tens of minutes** for the best current agents. This is not an implementation detail — it is a hard usability barrier.

### 2. The numbers that matter

**Step overhead**: the 16 evaluated agents require **2.7 to 4.3× more steps** than the optimal human trajectories manually annotated in OSWorld-Human. Even the most efficient agent in the cohort nearly triples the number of required actions.

**Progressive degradation**: each successive step takes **3× longer** than steps at the beginning of a task. This is driven by context accumulation in LLM calls — as trajectories grow longer, prompts grow larger, and inference becomes more expensive. It is a quadratic effect masquerading as linear.

**Identified bottleneck**: large model calls for **planning, reflection, and judging** account for most of the total latency. The interface actions themselves (clicks, keystrokes) are cheap — the synthetic cognition between each action is what dominates.

### 3. What OSWorld-Human concretely contributes

The manually annotated dataset provides, for each OSWorld task, a human reference trajectory. This enables two things the original benchmark did not support:

- **Measuring relative efficiency**: agent-steps / human-steps ratio, independent of task success or failure. - **Identifying over-navigation patterns**: where agents diverge from optimal paths, which task categories generate the most detours.

Before this work, no structured human baseline existed on OSWorld. Rigorous latency comparisons were impossible.

### 4. Implications for practitioners — and who loses

**For agent-building teams**: the reflective loop architecture (plan → act → reflect → re-judge) is the primary culprit. Reducing the number of LLM calls per task — through more direct policies, aggressive caching, or smaller models for judging steps — is now a quantifiable research direction.

**For model providers**: computer-use agents are a regime where per-token latency and long-context cost directly degrade user experience. Models optimized for batch throughput are poorly suited to this interactive setting.

**Direct losers**: systems that maximized OSWorld scores by stacking reflection and verification passes. An agent with 70% success but 4× too many steps is not production-deployable. Success rate alone was an incomplete compass — teams that over-optimized on it will need to rearchitect.

**Potential winners**: approaches that favor short, deterministic trajectories — agents trained by imitation on short human trajectories, or architectures without explicit reflection loops. OSWorld-Human now provides exactly the training data needed for this kind of approach.

The core signal: optimizing solely for benchmark success rate without latency constraints produces unusable systems. OSWorld-Human imposes a dual constraint — do it right *and* do it fast — which finally matches what real-world deployment actually requires.

Read source
Your take?
AI AgentsBenchmarksEvals

Summary generated by Claude — human-verified