Edition of2026-06-17

Fixed-budget evals understate model capabilities, e-commerce agents top out at 57%, and PreAct compiles successful runs into FSMs for 13x speedups.

The arXiv paper on inference compute (article 1) should force a revision of evaluation practices across product and research teams. Across 12 frontier models tested on FrontierMath, Humanity's Last Exam, and TerminalBench, increasing token budgets or allowing repeated attempts significantly improves scores — meaning current leaderboards reflect evaluation budget constraints as much as actual model capabilities. The direct implication: comparing GPT-4o to Claude 3.7 on a fixed-budget benchmark is like comparing cars with different tank sizes. Any model selection decision based on these benchmarks without controlling for inference compute is potentially confounded.

On the agent side, two complementary signals. EComAgentBench (662 tasks, Amazon) tests a realistic scenario: user intent is distributed across the query, user profile, and successive clarifications, with a hard cap of 100 tool calls. The best model reaches 57.1% — indicating that handling fragmented intent remains an open problem, not a solved one. PreAct addresses a different problem: latency and cost on repetitive tasks. By compiling successful agent runs into small finite-state machines (FSMs) replayed 8.5–13x faster without per-step LLM calls, and adding an independent validator that prevents accumulation of broken FSMs (+1.75–2.6 tasks on mobile/desktop/web benchmarks), PreAct offers a concrete architecture for agents deployed in production on stable workflows.

The Discrete-Log Clock paper (Nanda et al.) is the most mechanistic entry in today's selection: on the task a·b mod 113, a transformer implements not a standard DFT but a multiplicative character transform, with 96.9% of MLP neurons tuned to a single frequency and a sparse spectrum (Gini 0.58 vs 0.07). This is not an academic curiosity — it is a constraint on what to expect from transformers on modular arithmetic tasks, and a diagnostic tool for interpretability work. FllumaOne (100,000 CAD models, Qwen2.5-Coder-1.5B baseline at 99.14% STEP export validity) is primarily a signal on the maturity of code-native datasets for closed technical domains: syntactic validity is nearly solved, the real problem remains geometric semantics.

Today's 5 picks
01
02
03
04
05
Fixed-budget evals understate model capabilities, e-commerce agents top out at 57%, and PreAct compiles successful runs into FSMs for 13x speedups. · Signal IA