The arXiv paper on inference compute (article 1) should force a revision of evaluation practices across product and research teams. Across 12 frontier models tested on FrontierMath, Humanity's Last Exam, and TerminalBench, increasing token budgets or allowing repeated attempts significantly improves scores — meaning current leaderboards reflect evaluation budget constraints as much as actual model capabilities. The direct implication: comparing GPT-4o to Claude 3.7 on a fixed-budget benchmark is like comparing cars with different tank sizes. Any model selection decision based on these benchmarks without controlling for inference compute is potentially confounded.
On the agent side, two complementary signals. EComAgentBench (662 tasks, Amazon) tests a realistic scenario: user intent is distributed across the query, user profile, and successive clarifications, with a hard cap of 100 tool calls. The best model reaches 57.1% — indicating that handling fragmented intent remains an open problem, not a solved one. PreAct addresses a different problem: latency and cost on repetitive tasks. By compiling successful agent runs into small finite-state machines (FSMs) replayed 8.5–13x faster without per-step LLM calls, and adding an independent validator that prevents accumulation of broken FSMs (+1.75–2.6 tasks on mobile/desktop/web benchmarks), PreAct offers a concrete architecture for agents deployed in production on stable workflows.
The Discrete-Log Clock paper (Nanda et al.) is the most mechanistic entry in today's selection: on the task a·b mod 113, a transformer implements not a standard DFT but a multiplicative character transform, with 96.9% of MLP neurons tuned to a single frequency and a sparse spectrum (Gini 0.58 vs 0.07). This is not an academic curiosity — it is a constraint on what to expect from transformers on modular arithmetic tasks, and a diagnostic tool for interpretability work. FllumaOne (100,000 CAD models, Qwen2.5-Coder-1.5B baseline at 99.14% STEP export validity) is primarily a signal on the maturity of code-native datasets for closed technical domains: syntactic validity is nearly solved, the real problem remains geometric semantics.
Study evaluating 12 frontier models on inference compute impact across seven benchmarks. Three interventions tested: larger token budgets, context compaction, repeated submission attempts. Results: increased budgets substantially improve performance on FrontierMath, Humanity's Last Exam, TerminalBench. Fixed-budget evaluations increasingly understate newer model capabilities.
EComAgentBench is a benchmark of 662 e-commerce tasks evaluating LLM-based shopping agents on hidden intents distributed across query, user profile, and clarifications. Requirements are scattered and agents must uncover them within 100 tool calls. The strongest model achieves only 57.1% accuracy.
Researchers show that a transformer learning modular multiplication uses multiplicative character transform rather than standard DFT. On a·b mod 113, the spectrum becomes sparse (Gini 0.58 vs 0.07), with 96.9% of MLP neurons tuned to a single frequency. The algorithm implements a "Discrete-Log Clock" reducing multiplication to addition in discrete-log space.
PreAct compiles successful runs of computer-using agents into small state-machine programs, replayed 8.5-13x faster with no per-step LLM calls. An independent evaluator validates each program before storage. Across three benchmarks (mobile, desktop, web), this verification prevents faulty program accumulation (+1.75-2.6 tasks).
FllumaOne is a multimodal CAD dataset of 100,000 models generated by executable Python programs in Flluma (OpenCASCADE-based CAD system). Each sample aligns the program with a feature tree, STEP representation, point cloud, and natural-language descriptions. A Qwen2.5-Coder-1.5B baseline achieves 99.98% Python syntax validity and 99.14% STEP-export validity.