Edition of2026-06-17

Fixed-budget evals understate model capabilities, e-commerce agents top out at 57%, and PreAct compiles successful runs into FSMs for 13x speedups.

By the editorial team

The arXiv paper on inference compute (article 1) should force a revision of evaluation practices across product and research teams. Across 12 frontier models tested on FrontierMath, Humanity's Last Exam, and TerminalBench, increasing token budgets or allowing repeated attempts significantly improves scores — meaning current leaderboards reflect evaluation budget constraints as much as actual model capabilities. The direct implication: comparing GPT-4o to Claude 3.7 on a fixed-budget benchmark is like comparing cars with different tank sizes. Any model selection decision based on these benchmarks without controlling for inference compute is potentially confounded.

On the agent side, two complementary signals. EComAgentBench (662 tasks, Amazon) tests a realistic scenario: user intent is distributed across the query, user profile, and successive clarifications, with a hard cap of 100 tool calls. The best model reaches 57.1% — indicating that handling fragmented intent remains an open problem, not a solved one. PreAct addresses a different problem: latency and cost on repetitive tasks. By compiling successful agent runs into small finite-state machines (FSMs) replayed 8.5–13x faster without per-step LLM calls, and adding an independent validator that prevents accumulation of broken FSMs (+1.75–2.6 tasks on mobile/desktop/web benchmarks), PreAct offers a concrete architecture for agents deployed in production on stable workflows.

The Discrete-Log Clock paper (Nanda et al.) is the most mechanistic entry in today's selection: on the task a·b mod 113, a transformer implements not a standard DFT but a multiplicative character transform, with 96.9% of MLP neurons tuned to a single frequency and a sparse spectrum (Gini 0.58 vs 0.07). This is not an academic curiosity — it is a constraint on what to expect from transformers on modular arithmetic tasks, and a diagnostic tool for interpretability work. FllumaOne (100,000 CAD models, Qwen2.5-Coder-1.5B baseline at 99.14% STEP export validity) is primarily a signal on the maturity of code-native datasets for closed technical domains: syntactic validity is nearly solved, the real problem remains geometric semantics.

Today's 5 picks

arXiv cs.AI·SIG 82

How Inference Compute Shapes Frontier LLM Evaluation

Study evaluating 12 frontier models on inference compute impact across seven benchmarks. Three interventions tested: larger token budgets, context compaction, repeated submission attempts. Results: increased budgets substantially improve performance on FrontierMath, Humanity's Last Exam, TerminalBench. Fixed-budget evaluations increasingly understate newer model capabilities.

Benchmarks Evals Reasoning

arXiv cs.AI·SIG 82

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

EComAgentBench is a benchmark of 662 e-commerce tasks evaluating LLM-based shopping agents on hidden intents distributed across query, user profile, and clarifications. Requirements are scattered and agents must uncover them within 100 tool calls. The strongest model achieves only 57.1% accuracy.

AI Agents Benchmarks Evals

arXiv cs.LG·SIG 82

The Discrete-Log Clock: How a Transformer Learns Modular Multiplication

Researchers show that a transformer learning modular multiplication uses multiplicative character transform rather than standard DFT. On a·b mod 113, the spectrum becomes sparse (Gini 0.58 vs 0.07), with 96.9% of MLP neurons tuned to a single frequency. The algorithm implements a "Discrete-Log Clock" reducing multiplication to addition in discrete-log space.

Reasoning Papers Evals

arXiv cs.AI·SIG 82

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

PreAct compiles successful runs of computer-using agents into small state-machine programs, replayed 8.5-13x faster with no per-step LLM calls. An independent evaluator validates each program before storage. Across three benchmarks (mobile, desktop, web), this verification prevents faulty program accumulation (+1.75-2.6 tasks).

AI Agents Code generation Benchmarks

arXiv cs.AI·SIG 82

FllumaOne: A Code-Native Multimodal CAD Dataset with Executable Programs and Kernel-Validated Feature Histories

FllumaOne is a multimodal CAD dataset of 100,000 models generated by executable Python programs in Flluma (OpenCASCADE-based CAD system). Each sample aligns the program with a feature tree, STEP representation, point cloud, and natural-language descriptions. A Qwen2.5-Coder-1.5B baseline achieves 99.98% Python syntax validity and 99.14% STEP-export validity.

Code generation Benchmarks Vision