The defining event of the week is Anthropic's Series H raise — $65 billion at a $965 billion valuation, with annualized revenue of $47 billion according to CFO Krishna Rao — placing the company at the edge of the trillion-dollar club alongside Microsoft and Apple. This is not merely a financial milestone: Anthropic paired the announcement with the launch of Claude Opus 4.8, Dynamic Workflows, and ultracode, and Simon Willison integrated the model into llm-anthropic 0.25.1 within hours, exposing a fast mode via -o fast 1 for enabled organizations. The tooling ecosystem around Anthropic releases now operates at near-instantaneous speed, meaning practitioners no longer need to wait for official wrappers to test new capabilities in production. The open question remains: at what ARR does Anthropic reach structural profitability, given that compute costs and safety research absorb an unknown fraction of those $47 billion?
The second dominant theme is the collision between agentic ambition and empirical reality. ITBench-AA, co-developed by Artificial Analysis and IBM, is the first benchmark focused on enterprise IT tasks in agentic mode — ticketing, incident remediation, workflow orchestration — and the results are unambiguous: Claude, GPT-4, and Gemini all score below 50%. This figure must be read alongside IDS (Inductive Deductive Synthesis), which achieves 7/7 on distributed key-value store specifications where GPT-5.4 and Claude Opus 4.6 solve only 2/7, in 6.8 hours for $106. The lesson is not contradictory: multi-agent systems with formal scaffolding (Lean 4, symbolic verification) outperform solo frontier models on structured tasks, but real IT environments — heterogeneous, underspecified, without a verification oracle — remain out of reach. The gap between laboratory benchmark and operational deployment has never been so thoroughly documented in a single week.
Beyond the two central themes, the week produced an unusual density of architectural and interpretability work. FuRA (arXiv:2605.22869) proposes a LoRA alternative via full SVD decomposition with spectral preconditioning, gaining +1.37 reasoning points on LLaMA-3-8B without significant parameter overhead — a result that warrants replication before adoption. Delta Attention Residuals (arXiv:2605.18855) reduces perplexity by 8.2% at 7.6B parameters by routing on inter-layer deltas rather than cumulative hidden states, with less than 0.01% parameter overhead. On the interpretability side, two convergent papers show that sparse autoencoders decomposing GPT-2 XL and Llama-3.1-8B into 16K–32K features recover 94% of peak brain encoding performance (r=0.285) and align with known semantic cortical topography (ρ=0.72, p<0.001) — a result reinforcing the thesis that LLM semantic representations are not arbitrary but converge toward universal cognitive structures, independent of training language according to Brain-LLM Alignment (arXiv:2605.23032v1).
The coming week will likely see the first independent benchmarks on Claude Opus 4.8 and Dynamic Workflows, which will determine whether the $965 billion raise translates into a measurable qualitative leap or whether the ITBench-AA gap persists despite the new architecture.
ITBench-AA, a new benchmark from Artificial Analysis and IBM, evaluates frontier models on agentic enterprise IT tasks. Top models (Claude, GPT-4, Gemini) score below 50%, exposing significant gaps in automating complex IT workflows.
Anthropic raises $65 billion in Series H at a $965 billion valuation. Annualized revenue reaches $47 billion according to CFO Krishna Rao. The company will invest in safety research, computing capacity, and expanding its Claude product lineup.
FuRA introduces full-rank parameter-efficient fine-tuning via spectral preconditioning through SVD decomposition. By freezing pretrained singular bases and optimizing only compact cores via block tensor-train factorization, FuRA outperforms full fine-tuning and LoRA on LLaMA-3-8B (+1.37 commonsense reasoning) and VLMs while maintaining LoRA-comparable efficiency.
Systematic study comparing state space models (SSM) for time series classification. S4D outperforms Mamba variants in accuracy and efficiency. Authors introduce MS4 and MS4N, lightweight S4D variants with linear input projection and channel-mixing. Evaluation on 59 datasets (MONSTER, UEA): MS4N matches models 10× larger in parameters.
Delta Attention Residuals improves residual connections by routing over layer deltas (vᵢ = hᵢ₊₁ − hᵢ) instead of cumulative hidden states. Results: −8.2% PPL at 7.6B, 1.8× sharper cross-layer routing (max weight 0.2→0.6), <0.01% parameter overhead. Code and paper released.
ImProver 2 is a neurosymbolic framework for automated proof optimization in Lean 4. A 7B-parameter model trained outperforms orders-of-magnitude larger models and is competitive with mid-tier frontier models. The scaffold exposes formal structure alongside lightweight informal abstractions.
Brain-LLM alignment depends on training language dominance, not inherent English properties. Test on 112 participants (English, Chinese, French) with 7 LLMs: a Chinese-dominant model (Baichuan2-7B) reverses alignment gradient. Typological distance and tokenization fertility explain remaining variation.
Sparse autoencoders decompose GPT-2 XL and Llama-3.1-8B into 16K-32K interpretable features per layer. Semantic features alone recover 94% of peak encoding performance (r=0.285) and align with known cortical semantic organization (ρ=0.72, p<0.001). Results generalize across English, Chinese, and French.
Study showing model collapse (progressive degradation of LLMs trained on their own outputs) follows cultural evolution laws. Tests on LLaMA-2-7B and Mistral-7B over 10 generations in English, German, and Turkish reveal compositionality follows non-monotonic trajectory (rise then fall). Task-grounded filtering, not random filtering, sustains quality.
IDS (Inductive Deductive Synthesis) is a multi-agent LLM system jointly synthesizing implementation and formal proof for distributed systems. On 7 key-value store specifications, IDS achieves 7/7 in 6.8h/$106, versus 2/7 for GPT-5.4 and Claude Opus 4.6. Result is 200x faster than expert effort, 17% cheaper than SOTA agents.