MacArena: Benchmarking Computer Use Agents on an Online macOS Environment
In three linesMacArena is a benchmark of 421 tasks across 50 macOS applications, evaluating computer-use agents on native Apple Silicon environments. Results show leading models drop 26% performance on macOS-native tasks, revealing that existing benchmarks fail to capture genuine cross-platform GUI complexity.
## MacArena: How macOS Exposes Structural Limits in GUI Agents
### 1. The gap this benchmark fills
Since 2024, OSWorld has been the de facto standard for evaluating computer-use agents (CUAs) on virtualized Linux/Windows environments. The problem: macOS was nearly absent from the picture. The only existing benchmark, macOSWorld, covered a narrow slice — mostly first-party Apple apps — running on x86 VMs incompatible with Apple Silicon. Yet Apple Silicon now represents every Mac sold since late 2020, meaning it is the actual deployment environment for any enterprise or developer targeting macOS.
MacArena fills this gap with 421 manually verified tasks across 50 applications, running natively on Apple's Virtualization framework — no x86 emulation. The composition is hybrid: ported OSWorld tasks, macOSWorld content, and 49 new macOS-native tasks. That last group is the most informative.
### 2. The number that matters: -26% and rank inversions
The central finding is stark. A leading model on existing benchmarks drops by more than 26 percentage points on the MacArena native subset. More revealing: model rankings invert between ported tasks (from OSWorld) and macOS-native tasks. A model that dominates the former can rank last on the latter.
This inversion signals that current OSWorld performance partly measures familiarity with a specific task distribution — Linux/GNOME visual patterns, GTK/Qt interface conventions — rather than genuine cross-platform GUI competence. Agents have learned benchmark artifacts, not the underlying skill.
In practice, macOS presents distinct GUI challenges: global menu bar (vs. window-embedded menus), Dock, Mission Control, window management without native maximize, Cmd vs. Ctrl shortcuts, system-specific dialogs (permissions, Keychain), and different UI density in apps like Xcode, Final Cut Pro, or Logic Pro.
### 3. Implications for RL training pipelines
OSWorld serves not only as a benchmark but as a training environment for reinforcement learning. Models like Claude Computer Use, GPT-4o with vision, and open-source agents like UFO have been optimized on these distributions. If MacArena confirms that this optimization is platform-specific, current RL pipelines are producing agents overfit to Linux/Windows.
The practical consequence: any production deployment of CUA agents on macOS — workflow automation, RPA, desktop assistants — must be re-evaluated with MacArena or a native equivalent. OSWorld metrics do not transfer.
### 4. Potential losers and blind spots
**CUA agent vendors**: Anthropic (Computer Use), OpenAI (Operator), and RPA startups like Induced AI or Adept see their public benchmarks weakened. If their models regress 26% on native macOS, performance claims need recalibration for enterprise Mac customers.
**OSWorld as de facto standard**: The paper implicitly argues that OSWorld has introduced selection bias into research. Teams that invested heavily in OSWorld optimization — synthetic data, reward shaping — will need to revisit their stack.
**macOSWorld**: The prior benchmark is directly superseded — coverage too narrow, hardware incompatibility, tasks too simple. Its content survives inside MacArena, but it loses credibility as a standalone reference.
What MacArena does not yet solve: 50 applications is still limited relative to the real macOS ecosystem (App Store, pro tools, CLI via Terminal). The 49 new native tasks are promising but too few to train models at scale. Automated ground truth instrumentation on macOS — harder than on Linux — is not fully addressed. The benchmark is currently an evaluation tool; its use as an RL training environment at scale remains undemonstrated.
Summary generated by Claude — human-verified