Edition of2026-06-03

AI benchmarks are broken — formal proofs advance while LLM judges diverge from humans

Two signals converge today on evaluation reliability. First, FOLIO and MALLS — two standard logical reasoning benchmarks — contain 39% and 36% errors respectively in their FOL formalizations. Scores published on these datasets for years are therefore partially fictional. The proposed correction yields +9 to +22 points on Gemma 31B, Qwen3-30B, and GPT-4o-mini, meaning models have likely been systematically under- or over-evaluated depending on error direction. Second, the geometric study on LLM-as-Judge quantifies what many suspected: across 41 LLM judges and 8 Indic languages, model evaluation axes are nearly orthogonal to human axes (87-89° divergence), and inter-LLM agreement (r≈0.35) consistently exceeds LLM-human agreement (r≈0.27-0.32). Using an LLM to validate another LLM measures internal consistency, not alignment with human preferences.

Meanwhile, LEAP demonstrates that formal verification is a credible escape hatch around the evaluation problem. By decomposing mathematical proofs into sub-goals verifiable by the Lean compiler, the framework solves all 12 Putnam 2025 problems and reaches 70% on Lean-IMO-Bench versus under 10% for generic LLMs. Verification here is not subjective: the compiler accepts or rejects. That is precisely what NLP benchmarks cannot offer. The open question is whether this approach generalizes beyond formalizable domains.

On agents, DeskCraft and MedCUA-Bench deliver the same diagnosis from two angles. GPT-5.4 at 31.6% on desktop workflows exceeding 50 steps, and the best closed models at 54.2% on clinical interfaces while open-source agents average 2.5%: GUI agents remain far from operational reliability. DeskCraft specifically surfaces weaknesses in proactive clarification — agents execute without asking — which in a medical context like MedCUA-Bench becomes a direct risk. These two benchmarks measure different things but reach the same conclusion: long-horizon execution and instruction ambiguity are the two current walls.

Today's 5 picks
01
02
03
04
05