Week of2026-06-01

Week of June 1, 2026: evaluation in crisis, agents failing the long-horizon test

By the editorial team

The dominant theme of this week is not a model announcement but a systemic challenge to how the field measures its own progress. Three papers converge on the same diagnosis: our evaluation metrics are structurally flawed. The geometric study on LLM-as-Judge (arXiv:2606.03043) quantifies what many suspected: across 41 LLM judges and 8 Indic languages, the models' evaluation axis is nearly orthogonal to that of humans (87-89° divergence), and inter-LLM consensus (r≈0.35) systematically exceeds LLM-human alignment (r≈0.27-0.32). Simultaneously, the audit of FOLIO and MALLS reveals 39% and 36% error rates in the FOL formalizations of these reference benchmarks — benchmarks on which dozens of papers have been published. The practical consequence is severe: model rankings on these corpora are partly fictitious, and the +9 to +22 point gains observed after correction on Gemma 31B, Qwen3-30B, and GPT-4o-mini do not reflect model improvement but corrected test data. The clinical counterfactual evaluation (CSS metric) drives the point home: six frontier models ranked similarly on traditional metrics completely invert their ordering when assessed on the ability to adapt oncological recommendations to case mutations, with a universal blind spot on surgical status changes.

The second structural theme is the documented failure of agents on long-horizon tasks. LongDS-Bench (68 tasks, 2,225 turns drawn from real Kaggle notebooks) caps the best models at 48.45% accuracy, with a 47-point drop between early and late turns — long-horizon errors account for 52 to 69% of total failures. DeskCraft confirms the pattern on professional GUI workflows: GPT-5.4, the best agent tested across 538 tasks in design, video, audio, and 3D, reaches only 31.6% in standard mode. MedCUA-Bench adds a critical dimension: in authentic clinical interfaces (OpenEMR), the best closed models achieve 54.2% strict success, while open-source agents average 2.5%. This triptych draws a consistent boundary: current agents handle short, well-defined tasks adequately but lose analytical and procedural coherence once the horizon exceeds a few dozen steps. The Eywa memory architecture (90.19% on LoCoMo, 88.2% on LongMemEval-S) offers a serious direction — immutable source storage, typed validation, deterministic retrieval without LLM calls — but remains a partial solution to a deeper architectural problem.

A third signal, quieter but with strong operational impact, sees two foundational vulnerabilities surface simultaneously. WASH demonstrates that averaging the probability distributions of 3 to 5 models drops z-scores for six major watermarking schemes from 5-300 to below 2 (detection threshold: 4), rendering statistical traceability of generated content practically inoperative. On the internal safety side, the study on linear representations of synthetic deception (Pythia-1.4B, Gemma-2, Qwen2.5-7B, Llama-3.1-8B) shows that linear probes detect deception with AUC ≥0.99 as early as layers 1-3, opening a concrete path for activation-based monitoring — but also confirming that the capacity for coherent deception is encoded very early in the network. On the infrastructure front, the merge of the tensor-mode multi-GPU KV cache fix in llama.cpp b9455 (JohannesGaessler) is the kind of silent fix that unblocks local deployment configurations that have been stalled for weeks. Finally, LEAP solving all 12 Putnam 2025 problems in Lean and reaching 70% on Lean-IMO-Bench (versus <10% for generic LLMs) confirms that formal verification via iterative agentic decomposition is now a mature research direction, not an experimental one.

The coming week will likely see at least one major paper attempt to propose an alternative evaluation protocol to LLM-as-Judge, as the critical pressure has reached a threshold that benchmark teams can no longer easily ignore.

Today's 5 picks

GitHub Trending·SIG 85

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> openai /</span> whisper

OpenAI Whisper is a speech recognition model trained on 680,000 hours of multilingual weakly supervised data. The GitHub repository includes code, pre-trained models, and benchmarks for robust speech transcription across 99 languages.

OpenAI Voice Benchmarks

arXiv cs.AI·SIG 85

LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

LEAP is an agentic framework enabling LLMs to generate mechanically verifiable formal proofs in Lean. The system decomposes complex problems into smaller units through iterative interaction with the Lean compiler. On 2025 Putnam Competition (12 problems), LEAP solves all 12; on Lean-IMO-Bench, it achieves 70% one-shot solve rate versus <10% for general-purpose LLMs.

AI Agents Reasoning Benchmarks

arXiv cs.LG·SIG 82

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

A new counterfactual evaluation metric (CSS) reveals that six frontier models ranked similarly on traditional coverage-based metrics rank in nearly opposite order when assessed on their ability to update clinical recommendations in response to oncology case mutations. All models fail on surgery-status interventions, a safety blind spot invisible to coverage metrics.

Benchmarks Evals AI Agents

arXiv cs.CL·SIG 82

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

Geometric study showing inter-LLM agreement on subjective evaluations does not reflect human alignment. Across 41 LLM judges and 8 Indic languages, models use 30-50% of human score range, with evaluation axis nearly orthogonal to humans (87-89° vs 78-81°). LLM-LLM agreement (r≈0.35) exceeds LLM-human (r≈0.27-0.32). Only post-hoc calibration improves all rubrics.

Evals Alignment Benchmarks

arXiv cs.LG·SIG 82

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

Multi-model study (Pythia-1.4B, Gemma-2, Qwen2.5-7B, Llama-3.1-8B) on linear representations of synthetic dishonesty. Linear probes detect deception with AUC ≥0.99 as early as layers 1-3. Dishonesty representations consolidate progressively in deeper layers, with implications for activation-based monitoring.

Papers AI safety Alignment

arXiv cs.LG·SIG 82

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

LongDS-Bench evaluates AI agents' ability to maintain analytical context over long horizons. The benchmark contains 68 multi-turn data analysis tasks (2,225 turns) from real Kaggle notebooks. Best models reach only 48.45% accuracy, with a 47-point performance drop from early to late turns. Long-horizon errors account for 52–69% of failures.

AI Agents Benchmarks Evals

Reddit r/LocalLLaMA·SIG 82

ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED

llama.cpp b9455 merges a major fix for KV cache quantization in tensor mode on multi-GPU. The solution extends the meta backend to properly handle tensor flattening without losing shape information, avoiding changes to compute graphs.

Llama Open source Infrastructure

arXiv cs.CL·SIG 82

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Systematic audit of FOLIO and MALLS benchmarks reveals 39% and 36% errors in FOL formalizations respectively. Authors release corrected annotations and an LLM-based framework to guide manual relabeling, achieving 90% dataset accuracy by reviewing <24% of instances versus >70% for unguided review. Testing on Gemma 31B, Qwen3-30B, and GPT-4o-mini shows +9 to +22 percentage point accuracy gains.

Benchmarks Evals Reasoning

arXiv cs.AI·SIG 82

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

DeskCraft is a desktop GUI benchmark for agents on long-horizon professional workflows (>50 steps) in design, video, audio, and 3D with human-agent collaboration. 18 agents tested on 538 tasks: GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Reveals persistent failures in proactive clarification and long-horizon workflow delivery.

AI Agents Benchmarks Evals

arXiv cs.AI·SIG 82

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

MedCUA-Bench is an interactive benchmark for evaluating computer-use agents in clinical interfaces. It covers 18 medical scenarios across 10 domains with authentic interfaces. Best closed-source models reach 54.2% strict success, open-source agents average 2.5%, exposing a major gap with required reliability.

AI Agents Benchmarks AI safety

arXiv cs.CL·SIG 82

Eywa: Provenance-Grounded Long-Term Memory for AI Agents

Eywa is a provenance-grounded memory architecture for persistent AI agents, storing immutable source evidence before deriving facts and validating memories against typed signals. Retrieval uses a deterministic multi-route read path with zero LLM calls. Results: 90.19% judge accuracy on LoCoMo C1-C4, 88.2% on LongMemEval-S, 81.45% mean nugget score on BEAM.

AI Agents Benchmarks Papers

arXiv cs.CL·SIG 82

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

Researchers reveal that statistical watermarks in LLMs are vulnerable to linear ensembles. Averaging probability distributions across 3-5 models cancels out watermark perturbations. WASH (Watermark Attenuation via Statistical Hybridisation) defeats detection across 6 watermarking schemes, reducing z-scores from 5-300 to <2 (threshold: 4), while improving output quality by 27.5%.

AI safety Alignment Papers