Week of2026-05-25

Anthropic nears $965B valuation while agentic IT benchmarks cap at 50%: a week that redraws frontier deployment limits

By the editorial team

The defining event of the week is Anthropic's Series H raise — $65 billion at a $965 billion valuation, with annualized revenue of $47 billion according to CFO Krishna Rao — placing the company at the edge of the trillion-dollar club alongside Microsoft and Apple. This is not merely a financial milestone: Anthropic paired the announcement with the launch of Claude Opus 4.8, Dynamic Workflows, and ultracode, and Simon Willison integrated the model into llm-anthropic 0.25.1 within hours, exposing a fast mode via -o fast 1 for enabled organizations. The tooling ecosystem around Anthropic releases now operates at near-instantaneous speed, meaning practitioners no longer need to wait for official wrappers to test new capabilities in production. The open question remains: at what ARR does Anthropic reach structural profitability, given that compute costs and safety research absorb an unknown fraction of those $47 billion?

The second dominant theme is the collision between agentic ambition and empirical reality. ITBench-AA, co-developed by Artificial Analysis and IBM, is the first benchmark focused on enterprise IT tasks in agentic mode — ticketing, incident remediation, workflow orchestration — and the results are unambiguous: Claude, GPT-4, and Gemini all score below 50%. This figure must be read alongside IDS (Inductive Deductive Synthesis), which achieves 7/7 on distributed key-value store specifications where GPT-5.4 and Claude Opus 4.6 solve only 2/7, in 6.8 hours for $106. The lesson is not contradictory: multi-agent systems with formal scaffolding (Lean 4, symbolic verification) outperform solo frontier models on structured tasks, but real IT environments — heterogeneous, underspecified, without a verification oracle — remain out of reach. The gap between laboratory benchmark and operational deployment has never been so thoroughly documented in a single week.

Beyond the two central themes, the week produced an unusual density of architectural and interpretability work. FuRA (arXiv:2605.22869) proposes a LoRA alternative via full SVD decomposition with spectral preconditioning, gaining +1.37 reasoning points on LLaMA-3-8B without significant parameter overhead — a result that warrants replication before adoption. Delta Attention Residuals (arXiv:2605.18855) reduces perplexity by 8.2% at 7.6B parameters by routing on inter-layer deltas rather than cumulative hidden states, with less than 0.01% parameter overhead. On the interpretability side, two convergent papers show that sparse autoencoders decomposing GPT-2 XL and Llama-3.1-8B into 16K–32K features recover 94% of peak brain encoding performance (r=0.285) and align with known semantic cortical topography (ρ=0.72, p<0.001) — a result reinforcing the thesis that LLM semantic representations are not arbitrary but converge toward universal cognitive structures, independent of training language according to Brain-LLM Alignment (arXiv:2605.23032v1).

The coming week will likely see the first independent benchmarks on Claude Opus 4.8 and Dynamic Workflows, which will determine whether the $965 billion raise translates into a measurable qualitative leap or whether the ITBench-AA gap persists despite the new architecture.

Today's 5 picks

Hugging Face Blog·SIG 85

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

ITBench-AA, a new benchmark from Artificial Analysis and IBM, evaluates frontier models on agentic enterprise IT tasks. Top models (Claude, GPT-4, Gemini) score below 50%, exposing significant gaps in automating complex IT workflows.

Benchmarks AI Agents Claude

Latent Space·SIG 85

[AINews] Anthropic raises $965B Series H, releases Opus 4.8 and Dynamic Workflows/ultracode

Anthropic raises $965B Series H and launches Opus 4.8 with Dynamic Workflows and ultracode. Major funding expansion and new model capabilities.

Anthropic Claude Funding

Simon Willison·SIG 85

llm-anthropic 0.25.1

Release of llm-anthropic 0.25.1: adds Claude Opus 4.8 model, -o fast 1 option for fast mode (enabled organizations), and default max_tokens now matches each model's maximum output instead of 8192.

Claude Anthropic Tools

The Decoder·SIG 85

Claude company Anthropic nears a trillion-dollar valuation after raising $65 billion in Series H

Anthropic raises $65 billion in Series H at a $965 billion valuation. Annualized revenue reaches $47 billion according to CFO Krishna Rao. The company will invest in safety research, computing capacity, and expanding its Claude product lineup.

Claude Anthropic Funding

arXiv cs.LG·SIG 82

FuRA: Full-Rank Parameter-Efficient Fine-Tuning with Spectral Preconditioning

FuRA introduces full-rank parameter-efficient fine-tuning via spectral preconditioning through SVD decomposition. By freezing pretrained singular bases and optimizing only compact cores via block tensor-train factorization, FuRA outperforms full fine-tuning and LoRA on LLaMA-3-8B (+1.37 commonsense reasoning) and VLMs while maintaining LoRA-comparable efficiency.

Fine-tuning Llama Reinforcement learning

arXiv cs.LG·SIG 82

A Simple State Space Model Excels at Multivariate Time Series Classification

Systematic study comparing state space models (SSM) for time series classification. S4D outperforms Mamba variants in accuracy and efficiency. Authors introduce MS4 and MS4N, lightweight S4D variants with linear input projection and channel-mixing. Evaluation on 59 datasets (MONSTER, UEA): MS4N matches models 10× larger in parameters.

Benchmarks Papers Reasoning

Reddit r/MachineLearning·SIG 82

𝐃𝐞𝐥𝐭𝐚 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐑𝐞𝐬𝐢𝐝𝐮𝐚𝐥𝐬 [R]

Delta Attention Residuals improves residual connections by routing over layer deltas (vᵢ = hᵢ₊₁ − hᵢ) instead of cumulative hidden states. Results: −8.2% PPL at 7.6B, 1.8× sharper cross-layer routing (max weight 0.2→0.6), <0.01% parameter overhead. Code and paper released.

Papers Benchmarks Open source

arXiv cs.AI·SIG 82

ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

ImProver 2 is a neurosymbolic framework for automated proof optimization in Lean 4. A 7B-parameter model trained outperforms orders-of-magnitude larger models and is competitive with mid-tier frontier models. The scaffold exposes formal structure alongside lightweight informal abstractions.

Reasoning Fine-tuning Papers

arXiv cs.CL·SIG 82

Brain-LLM Alignment Tracks Training Data, Not Typology

Brain-LLM alignment depends on training language dominance, not inherent English properties. Test on 112 participants (English, Chinese, French) with 7 LLMs: a Chinese-dominant model (Baichuan2-7B) reverses alignment gradient. Typological distance and tokenization fertility explain remaining variation.

Benchmarks Alignment Papers

arXiv cs.CL·SIG 82

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

Sparse autoencoders decompose GPT-2 XL and Llama-3.1-8B into 16K-32K interpretable features per layer. Semantic features alone recover 94% of peak encoding performance (r=0.285) and align with known cortical semantic organization (ρ=0.72, p<0.001). Results generalize across English, Chinese, and French.

Papers GPT Llama

arXiv cs.CL·SIG 82

Model Collapse as Cultural Evolution

Study showing model collapse (progressive degradation of LLMs trained on their own outputs) follows cultural evolution laws. Tests on LLaMA-2-7B and Mistral-7B over 10 generations in English, German, and Turkish reveal compositionality follows non-monotonic trajectory (rise then fall). Task-grounded filtering, not random filtering, sustains quality.

Llama Mistral Papers

arXiv cs.AI·SIG 82

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

IDS (Inductive Deductive Synthesis) is a multi-agent LLM system jointly synthesizing implementation and formal proof for distributed systems. On 7 key-value store specifications, IDS achieves 7/7 in 6.8h/$106, versus 2/7 for GPT-5.4 and Claude Opus 4.6. Result is 200x faster than expert effort, 17% cheaper than SOTA agents.

AI Agents Multi-agent Code generation