Two papers today attack the same problem from opposite angles: knowing when to escalate. AEGIS (arXiv:2606.06660) addresses it in long-horizon robotics — instead of letting a weak policy spiral on a critical step, the system reads that policy's frozen activations to detect risk, then hands off to a stronger policy. On LIBERO-Spatial: +10.1% recovered trajectories vs. 4.6% for blind escalation, with the strong policy activated on only 38% of steps. The multi-agent study (arXiv:2602.04234) reaches a symmetric conclusion: in 43.3% of cases across 6 reasoning benchmarks, a single agent beats a multi-agent system. The proposed Entropy Judger selects configuration based on the problem's baseline entropy — meaning coordination overhead is only justified when initial uncertainty is high enough. Both papers converge: escalating by default is a bad heuristic.
MacArena drives home a point computer-use practitioners know intuitively but benchmarks had ignored: 421 tasks across 50 native macOS apps on Apple Silicon show a 26% regression for models that perform well on Linux. OSWorld and macOSWorld don't capture the complexity of cross-platform GUIs — meaning published scores on those benchmarks don't predict production performance on macOS. For teams deploying GUI agents on Apple fleets, this is a direct evaluation signal.
On the NLP side, PolyFact (100K factual questions, 12 languages, Wikidata-anchored) shows GRPO outperforms supervised fine-tuning for cross-lingual factual consistency on Qwen-2.5-7B and OLMo-2-1124-7B, by reducing linguistic specialization in MLP layers. HKJudge (~290K sentences, ~6.5M tokens of HK criminal judgments) is more niche but is the first sentence-level annotated resource for legal discourse in a common law jurisdiction — relevant for LegalTech teams working outside US English.
AEGIS detects high-risk steps in long-horizon robot manipulation by probing frozen activations of a weak policy. Upon detection, control switches to a stronger policy for only those steps. On LIBERO-Spatial, AEGIS recovers 10.1% of lost trajectories (vs 4.6% for blind escalation), activating the stronger policy on only 38% of steps.
PolyFact, a 100K multilingual factual QA dataset grounded in Wikidata across 12 languages, evaluates three approaches to improve cross-lingual factual consistency in Qwen-2.5-7B and OLMo-2-1124-7B. GRPO outperforms supervised fine-tuning by reducing language specialization in MLP layers and attention heads, promoting shared cross-lingual representations.
HKJudge is the first sentence-level expert-annotated legal discourse corpus. It contains ~290k sentences and ~6.5M tokens from Hong Kong criminal judgments across all court levels, annotated by legal linguistics experts. Two benchmark tasks: rhetorical role classification (26 categories) and legal element extraction. Evaluation on BERT models, open-source and commercial LLMs.
MacArena is a benchmark of 421 tasks across 50 macOS applications, evaluating computer-use agents on native Apple Silicon environments. Results show leading models drop 26% performance on macOS-native tasks, revealing that existing benchmarks fail to capture genuine cross-platform GUI complexity.
Empirical study of 245 entropy features (token, agent, round) across 6 reasoning benchmarks and 2 agentic tasks. Counterintuitive finding: single agent outperforms MAS in 43.3% of cases. Three key observations: certainty preference, base entropy drives performance, task-dependent entropy dynamics. Entropy Judger algorithm proposed to select MAS solutions.