Two signals converge today on evaluation reliability. First, FOLIO and MALLS — two standard logical reasoning benchmarks — contain 39% and 36% errors respectively in their FOL formalizations. Scores published on these datasets for years are therefore partially fictional. The proposed correction yields +9 to +22 points on Gemma 31B, Qwen3-30B, and GPT-4o-mini, meaning models have likely been systematically under- or over-evaluated depending on error direction. Second, the geometric study on LLM-as-Judge quantifies what many suspected: across 41 LLM judges and 8 Indic languages, model evaluation axes are nearly orthogonal to human axes (87-89° divergence), and inter-LLM agreement (r≈0.35) consistently exceeds LLM-human agreement (r≈0.27-0.32). Using an LLM to validate another LLM measures internal consistency, not alignment with human preferences.
Meanwhile, LEAP demonstrates that formal verification is a credible escape hatch around the evaluation problem. By decomposing mathematical proofs into sub-goals verifiable by the Lean compiler, the framework solves all 12 Putnam 2025 problems and reaches 70% on Lean-IMO-Bench versus under 10% for generic LLMs. Verification here is not subjective: the compiler accepts or rejects. That is precisely what NLP benchmarks cannot offer. The open question is whether this approach generalizes beyond formalizable domains.
On agents, DeskCraft and MedCUA-Bench deliver the same diagnosis from two angles. GPT-5.4 at 31.6% on desktop workflows exceeding 50 steps, and the best closed models at 54.2% on clinical interfaces while open-source agents average 2.5%: GUI agents remain far from operational reliability. DeskCraft specifically surfaces weaknesses in proactive clarification — agents execute without asking — which in a medical context like MedCUA-Bench becomes a direct risk. These two benchmarks measure different things but reach the same conclusion: long-horizon execution and instruction ambiguity are the two current walls.
LEAP is an agentic framework enabling LLMs to generate mechanically verifiable formal proofs in Lean. The system decomposes complex problems into smaller units through iterative interaction with the Lean compiler. On 2025 Putnam Competition (12 problems), LEAP solves all 12; on Lean-IMO-Bench, it achieves 70% one-shot solve rate versus <10% for general-purpose LLMs.
Systematic audit of FOLIO and MALLS benchmarks reveals 39% and 36% errors in FOL formalizations respectively. Authors release corrected annotations and an LLM-based framework to guide manual relabeling, achieving 90% dataset accuracy by reviewing <24% of instances versus >70% for unguided review. Testing on Gemma 31B, Qwen3-30B, and GPT-4o-mini shows +9 to +22 percentage point accuracy gains.
DeskCraft is a desktop GUI benchmark for agents on long-horizon professional workflows (>50 steps) in design, video, audio, and 3D with human-agent collaboration. 18 agents tested on 538 tasks: GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Reveals persistent failures in proactive clarification and long-horizon workflow delivery.
Geometric study showing inter-LLM agreement on subjective evaluations does not reflect human alignment. Across 41 LLM judges and 8 Indic languages, models use 30-50% of human score range, with evaluation axis nearly orthogonal to humans (87-89° vs 78-81°). LLM-LLM agreement (r≈0.35) exceeds LLM-human (r≈0.27-0.32). Only post-hoc calibration improves all rubrics.
MedCUA-Bench is an interactive benchmark for evaluating computer-use agents in clinical interfaces. It covers 18 medical scenarios across 10 domains with authentic interfaces. Best closed-source models reach 54.2% strict success, open-source agents average 2.5%, exposing a major gap with required reliability.