Week of2026-05-18

Week of May 18, 2026: formal reasoning breakthroughs, $1.25B/month compute deals, and the safety benchmark illusion

By the editorial team

The week's standout result is the formal refutation of Erdős's unit-distance conjecture, open since 1946, by an OpenAI reasoning model using algebraic number theory tools that mathematicians had not previously considered for this problem — Fields medalist Tim Gowers explicitly called it a "milestone." In the same formal register, OProver-32B reached 93.3% Pass@32 on MiniF2F and 58.2% on ProverBench in Lean 4, through a continuous pretraining / iterative post-training loop with compiler feedback. These two results are not isolated: they signal that reasoning models are beginning to produce non-trivial, formally verifiable mathematical contributions, which changes the nature of the proof of concept. The formal verification of 305 Lean 4 theorems embedded in the DASH paper (arXiv:2605.16282) fits the same pattern — AI-assisted formal reasoning is moving from benchmark performance to actual scientific output.

The second dominant theme is infrastructural and financial, with strategic implications that go well beyond accounting. The agreement disclosed in SpaceX's S-1 commits Anthropic to $1.25 billion per month in compute capacity on COLOSSUS and COLOSSUS II through May 2029 — potentially $45 billion over the contract's lifetime. SpaceX simultaneously uses those same clusters to train Grok 5, creating a co-dependency and direct competitive overlap between vendor and customer that is rarely seen at this scale. This figure reframes the usual conversations about inference costs: the real battle is now over access to sovereign training clusters, and actors without proprietary access to this class of infrastructure are structurally disadvantaged for the next training cycles.

The third theme, quieter but potentially the most durable for practitioners, is the methodological collapse of agent safety evaluations. The systematic analysis of 40 agent safety benchmarks (arXiv:2605.16282, 2023–2026) yields a Kendall's W of 0.10 (p = 0.94): existing benchmarks agree on nothing, their threat models are incompatible, and their metrics are fragmented. ASPI makes the same point from a different angle: in clarification mode, prompt injection success rates jump from 1.8% to 34.0% for o3 and from 2.2% to 35.7% for Gemini-3-Flash — an attack surface created by a behavior widely considered good UX practice. ContractBench rounds out the picture: across 38 models, none exceeds 80% observation-contract preservation, with Claude-Opus-4.6 capping at 77.8% and a non-monotone regression within the GPT-5 family. The cross-cutting lesson is that agent security in production cannot rely on current benchmarks to establish guarantees, and that certain behavioral improvements — clarification, chain-of-thought — introduce unanticipated vulnerabilities.

The coming week will likely see the first institutional reactions to the SpaceX-Anthropic compute agreement, particularly questions about the governance of a compute provider that simultaneously trains a competing model on the same infrastructure.

Today's 5 picks

Simon Willison·SIG 85

Quoting SpaceX S-1

SpaceX signed a Cloud Services Agreement with Anthropic to provide compute capacity on COLOSSUS and COLOSSUS II clusters. Anthropic will pay $1.25 billion per month through May 2029, with reduced fees during May-June 2026 ramp-up. SpaceX uses these resources to train Grok 5.

Anthropic Infrastructure Business

arXiv cs.AI·SIG 85

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

Systematic analysis of 40 agent safety benchmarks (2023-2026). Benchmarks exhibit incompatible threat models, fragmented metrics, and inconsistent risk coverage. Concordance test (Kendall's W = 0.10, p = 0.94) reveals no ranking alignment across evaluation dimensions. Releases structured metadata and proposes minimum reporting standards.

AI Agents AI safety Evals

arXiv cs.LG·SIG 85

The Attribution Impossibility: No Feature Ranking Is Faithful, Stable, and Complete Under Collinearity

Impossibility theorem: no feature ranking can be simultaneously faithful, stable, and complete under collinearity. Authors quantify the result for 4 model classes, propose DASH (Diversified Aggregation of SHAP) as resolution, and formally verify 305 Lean 4 theorems. Consequence: 68% of public datasets exhibit attribution instability.

Evals Papers AI safety

OpenAI Blog·SIG 85

An OpenAI model has disproved a central conjecture in discrete geometry

An OpenAI model disproved a major conjecture in discrete geometry by solving the 80-year-old unit distance problem. This breakthrough marks a milestone in AI-driven mathematics.

OpenAI Reasoning Benchmarks

GitHub Trending·SIG 85

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> openai /</span> whisper

OpenAI Whisper is a speech recognition model trained on 680,000 hours of multilingual weakly supervised data. The GitHub repository includes code, pre-trained models, and performance benchmarks across multiple languages and acoustic conditions.

OpenAI Voice Open source

The Decoder·SIG 85

OpenAI shifts the boundary of automated reasoning with a "milestone in AI mathematics" that experts are now unpacking

OpenAI's reasoning model disproved a 1946 Erdős conjecture in unit-distance geometry using unexpected algebraic number theory tools. Fields Medalist Tim Gowers calls it "a milestone in AI mathematics."

OpenAI Reasoning Benchmarks

GitHub Trending·SIG 85

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> facebookresearch /</span> sam3

Meta releases code and checkpoints for SAM 3 (Segment Anything Model 3). Repository includes inference, fine-tuning, and example notebooks for image segmentation.

Meta AI Vision Open source

arXiv cs.AI·SIG 82

ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

ASPI is a benchmark of 728 task-attack scenarios measuring how clarification amplifies prompt injection vulnerability. Testing on 10 frontier LLMs shows attack success rates rise from 1.8% to 34.0% for o3 and 2.2% to 35.7% for Gemini-3-Flash in clarification mode. Code and data released.

AI Agents AI safety Benchmarks

arXiv cs.AI·SIG 82

OProver: A Unified Framework for Agentic Formal Theorem Proving

OProver is a unified framework for agentic formal theorem proving in Lean 4. The system iteratively revises failed proof attempts using retrieved compiler-verified proofs and Lean compiler feedback. Trained via continued pretraining and iterative post-training, OProver-32B achieves 93.3% Pass@32 on MiniF2F and 58.2% on ProverBench.

AI Agents Reasoning Reinforcement learning

arXiv cs.AI·SIG 82

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

FML-Bench is a benchmark of 18 ML tasks across 10 domains evaluating 6 AI research agents. Key findings: strategy complexity alone does not ensure performance (greedy hill-climber matches tree-search); effectiveness depends on improvement opportunity structure; an adaptive agent detecting stagnation outperforms others. Includes 12 process-level behavioral metrics.

AI Agents Benchmarks Reasoning

arXiv cs.AI·SIG 82

ContractBench: Can LLM Agents Preserve Observation Contracts?

ContractBench benchmarks LLM agents' ability to preserve observation contracts (temporally valid, byte-level intact artifacts) in API calls. Of 38 models tested, none exceed 80%: Claude-Opus-4.6 leads at 77.8%. Results show integrity and validity failures uncorrelated with model size, and non-monotonic regression in the GPT-5 family despite larger scale.

AI Agents Benchmarks Claude

arXiv cs.AI·SIG 82

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Injecting mathematically wrong drafts from a smaller model (Qwen2.5-Math-1.5B) into stronger learner (Mathstral-7B) GRPO training improves performance on MATH-500 (+1.62pp) and AIME 2025/2026 (+14.2pp at pass@1024). Intentional mismatch between problems and drafts is critical: 71.98% on MATH-500, highest published result for this model.

Reinforcement learning Reasoning Benchmarks