Week of2026-05-18

Week of May 18, 2026: formal reasoning breakthroughs, $1.25B/month compute deals, and the safety benchmark illusion

The week's standout result is the formal refutation of Erdős's unit-distance conjecture, open since 1946, by an OpenAI reasoning model using algebraic number theory tools that mathematicians had not previously considered for this problem — Fields medalist Tim Gowers explicitly called it a "milestone." In the same formal register, OProver-32B reached 93.3% Pass@32 on MiniF2F and 58.2% on ProverBench in Lean 4, through a continuous pretraining / iterative post-training loop with compiler feedback. These two results are not isolated: they signal that reasoning models are beginning to produce non-trivial, formally verifiable mathematical contributions, which changes the nature of the proof of concept. The formal verification of 305 Lean 4 theorems embedded in the DASH paper (arXiv:2605.16282) fits the same pattern — AI-assisted formal reasoning is moving from benchmark performance to actual scientific output.

The second dominant theme is infrastructural and financial, with strategic implications that go well beyond accounting. The agreement disclosed in SpaceX's S-1 commits Anthropic to $1.25 billion per month in compute capacity on COLOSSUS and COLOSSUS II through May 2029 — potentially $45 billion over the contract's lifetime. SpaceX simultaneously uses those same clusters to train Grok 5, creating a co-dependency and direct competitive overlap between vendor and customer that is rarely seen at this scale. This figure reframes the usual conversations about inference costs: the real battle is now over access to sovereign training clusters, and actors without proprietary access to this class of infrastructure are structurally disadvantaged for the next training cycles.

The third theme, quieter but potentially the most durable for practitioners, is the methodological collapse of agent safety evaluations. The systematic analysis of 40 agent safety benchmarks (arXiv:2605.16282, 2023–2026) yields a Kendall's W of 0.10 (p = 0.94): existing benchmarks agree on nothing, their threat models are incompatible, and their metrics are fragmented. ASPI makes the same point from a different angle: in clarification mode, prompt injection success rates jump from 1.8% to 34.0% for o3 and from 2.2% to 35.7% for Gemini-3-Flash — an attack surface created by a behavior widely considered good UX practice. ContractBench rounds out the picture: across 38 models, none exceeds 80% observation-contract preservation, with Claude-Opus-4.6 capping at 77.8% and a non-monotone regression within the GPT-5 family. The cross-cutting lesson is that agent security in production cannot rely on current benchmarks to establish guarantees, and that certain behavioral improvements — clarification, chain-of-thought — introduce unanticipated vulnerabilities.

The coming week will likely see the first institutional reactions to the SpaceX-Anthropic compute agreement, particularly questions about the governance of a compute provider that simultaneously trains a competing model on the same infrastructure.

Today's 5 picks
01
02
03
04
05
06
07
08
09
10
11
12
Week of May 18, 2026: formal reasoning breakthroughs, $1.25B/month compute deals, and the safety benchmark illusion · Signal IA