Edition of2026-06-03

AI benchmarks are broken — formal proofs advance while LLM judges diverge from humans

By the editorial team

Today's 5 picks

LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

LEAP is an agentic framework enabling LLMs to generate mechanically verifiable formal proofs in Lean. The system decomposes complex problems into smaller units through iterative interaction with the Lean compiler. On 2025 Putnam Competition (12 problems), LEAP solves all 12; on Lean-IMO-Bench, it achieves 70% one-shot solve rate versus <10% for general-purpose LLMs.

AI Agents Reasoning Benchmarks

arXiv cs.CL·SIG 82

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Systematic audit of FOLIO and MALLS benchmarks reveals 39% and 36% errors in FOL formalizations respectively. Authors release corrected annotations and an LLM-based framework to guide manual relabeling, achieving 90% dataset accuracy by reviewing <24% of instances versus >70% for unguided review. Testing on Gemma 31B, Qwen3-30B, and GPT-4o-mini shows +9 to +22 percentage point accuracy gains.

Benchmarks Evals Reasoning

arXiv cs.AI·SIG 82

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

DeskCraft is a desktop GUI benchmark for agents on long-horizon professional workflows (>50 steps) in design, video, audio, and 3D with human-agent collaboration. 18 agents tested on 538 tasks: GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Reveals persistent failures in proactive clarification and long-horizon workflow delivery.

AI Agents Benchmarks Evals

arXiv cs.CL·SIG 82

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

Geometric study showing inter-LLM agreement on subjective evaluations does not reflect human alignment. Across 41 LLM judges and 8 Indic languages, models use 30-50% of human score range, with evaluation axis nearly orthogonal to humans (87-89° vs 78-81°). LLM-LLM agreement (r≈0.35) exceeds LLM-human (r≈0.27-0.32). Only post-hoc calibration improves all rubrics.

Evals Alignment Benchmarks

arXiv cs.AI·SIG 82

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

MedCUA-Bench is an interactive benchmark for evaluating computer-use agents in clinical interfaces. It covers 18 medical scenarios across 10 domains with authentic interfaces. Best closed-source models reach 54.2% strict success, open-source agents average 2.5%, exposing a major gap with required reliability.

AI Agents Benchmarks AI safety