The week's standout result is the formal refutation of Erdős's unit-distance conjecture, open since 1946, by an OpenAI reasoning model using algebraic number theory tools that mathematicians had not previously considered for this problem — Fields medalist Tim Gowers explicitly called it a "milestone." In the same formal register, OProver-32B reached 93.3% Pass@32 on MiniF2F and 58.2% on ProverBench in Lean 4, through a continuous pretraining / iterative post-training loop with compiler feedback. These two results are not isolated: they signal that reasoning models are beginning to produce non-trivial, formally verifiable mathematical contributions, which changes the nature of the proof of concept. The formal verification of 305 Lean 4 theorems embedded in the DASH paper (arXiv:2605.16282) fits the same pattern — AI-assisted formal reasoning is moving from benchmark performance to actual scientific output.
The second dominant theme is infrastructural and financial, with strategic implications that go well beyond accounting. The agreement disclosed in SpaceX's S-1 commits Anthropic to $1.25 billion per month in compute capacity on COLOSSUS and COLOSSUS II through May 2029 — potentially $45 billion over the contract's lifetime. SpaceX simultaneously uses those same clusters to train Grok 5, creating a co-dependency and direct competitive overlap between vendor and customer that is rarely seen at this scale. This figure reframes the usual conversations about inference costs: the real battle is now over access to sovereign training clusters, and actors without proprietary access to this class of infrastructure are structurally disadvantaged for the next training cycles.
The third theme, quieter but potentially the most durable for practitioners, is the methodological collapse of agent safety evaluations. The systematic analysis of 40 agent safety benchmarks (arXiv:2605.16282, 2023–2026) yields a Kendall's W of 0.10 (p = 0.94): existing benchmarks agree on nothing, their threat models are incompatible, and their metrics are fragmented. ASPI makes the same point from a different angle: in clarification mode, prompt injection success rates jump from 1.8% to 34.0% for o3 and from 2.2% to 35.7% for Gemini-3-Flash — an attack surface created by a behavior widely considered good UX practice. ContractBench rounds out the picture: across 38 models, none exceeds 80% observation-contract preservation, with Claude-Opus-4.6 capping at 77.8% and a non-monotone regression within the GPT-5 family. The cross-cutting lesson is that agent security in production cannot rely on current benchmarks to establish guarantees, and that certain behavioral improvements — clarification, chain-of-thought — introduce unanticipated vulnerabilities.
The coming week will likely see the first institutional reactions to the SpaceX-Anthropic compute agreement, particularly questions about the governance of a compute provider that simultaneously trains a competing model on the same infrastructure.
SpaceX signed a Cloud Services Agreement with Anthropic to provide compute capacity on COLOSSUS and COLOSSUS II clusters. Anthropic will pay $1.25 billion per month through May 2029, with reduced fees during May-June 2026 ramp-up. SpaceX uses these resources to train Grok 5.
Systematic analysis of 40 agent safety benchmarks (2023-2026). Benchmarks exhibit incompatible threat models, fragmented metrics, and inconsistent risk coverage. Concordance test (Kendall's W = 0.10, p = 0.94) reveals no ranking alignment across evaluation dimensions. Releases structured metadata and proposes minimum reporting standards.
Impossibility theorem: no feature ranking can be simultaneously faithful, stable, and complete under collinearity. Authors quantify the result for 4 model classes, propose DASH (Diversified Aggregation of SHAP) as resolution, and formally verify 305 Lean 4 theorems. Consequence: 68% of public datasets exhibit attribution instability.
An OpenAI model disproved a major conjecture in discrete geometry by solving the 80-year-old unit distance problem. This breakthrough marks a milestone in AI-driven mathematics.
OpenAI Whisper is a speech recognition model trained on 680,000 hours of multilingual weakly supervised data. The GitHub repository includes code, pre-trained models, and performance benchmarks across multiple languages and acoustic conditions.
OpenAI's reasoning model disproved a 1946 Erdős conjecture in unit-distance geometry using unexpected algebraic number theory tools. Fields Medalist Tim Gowers calls it "a milestone in AI mathematics."
Meta releases code and checkpoints for SAM 3 (Segment Anything Model 3). Repository includes inference, fine-tuning, and example notebooks for image segmentation.
ASPI is a benchmark of 728 task-attack scenarios measuring how clarification amplifies prompt injection vulnerability. Testing on 10 frontier LLMs shows attack success rates rise from 1.8% to 34.0% for o3 and 2.2% to 35.7% for Gemini-3-Flash in clarification mode. Code and data released.
OProver is a unified framework for agentic formal theorem proving in Lean 4. The system iteratively revises failed proof attempts using retrieved compiler-verified proofs and Lean compiler feedback. Trained via continued pretraining and iterative post-training, OProver-32B achieves 93.3% Pass@32 on MiniF2F and 58.2% on ProverBench.
FML-Bench is a benchmark of 18 ML tasks across 10 domains evaluating 6 AI research agents. Key findings: strategy complexity alone does not ensure performance (greedy hill-climber matches tree-search); effectiveness depends on improvement opportunity structure; an adaptive agent detecting stagnation outperforms others. Includes 12 process-level behavioral metrics.
ContractBench benchmarks LLM agents' ability to preserve observation contracts (temporally valid, byte-level intact artifacts) in API calls. Of 38 models tested, none exceed 80%: Claude-Opus-4.6 leads at 77.8%. Results show integrity and validity failures uncorrelated with model size, and non-monotonic regression in the GPT-5 family despite larger scale.
Injecting mathematically wrong drafts from a smaller model (Qwen2.5-Math-1.5B) into stronger learner (Mathstral-7B) GRPO training improves performance on MATH-500 (+1.62pp) and AIME 2025/2026 (+14.2pp at pass@1024). Intentional mismatch between problems and drafts is critical: 71.98% on MATH-500, highest published result for this model.