Edition of2026-06-12

Arbor shows tree-search multi-agent holds where single agents collapse — and Apple's fp8 turns out to be emulated

Two infrastructure papers stand out today. Arbor (arXiv:2606.12563) formalizes what many suspected: a single agent tasked with full-stack LLM inference optimization gains 33% then collapses within hours. The Orchestrator + Critic architecture with tree search reaches +193% Pareto improvement on throughput-latency over the same baselines. The number itself isn't surprising — tree search has been known since AlphaGo — but validating it empirically on LLM inference optimization, with an explicit checks-and-balances mechanism, finally gives a reproducible blueprint for teams building long-running ops agents. Rigel delivers a symmetric piece of bad news on the hardware side: on Apple M4 Max, the matmul2d fp8 (E4M3) operation via Metal 4.1 is emulated on GPU shader cores, with no dedicated matrix datapath, running at 0.94x fp16 throughput. Anyone who sized local inference on M4 Max expecting a real fp8 gain needs to revisit their numbers. Rigel's fused GEMM kernel recovers +6.5–12.9% in cache-resident regime, but that's a workaround, not a fix.

On the formal reasoning front, Pythagoras-Prover (4B and 32B, open-source) beats DeepSeek-Prover-V2-671B on MiniF2F-Test with 167x fewer parameters (86.1% vs 82.4% for the 4B, 93.0% for the 32B). The gain comes from curriculum SFT combined with Augmented Lean Formalisation, not a parameter race. This is the clearest signal to date that efficient formal provers are a data and curriculum problem, not a model size problem. The 32B also solves 93/672 PutnamBench problems — modest, but measurable on a benchmark designed to be hard for humans.

Two domain papers round out the selection. OpenMedQ (14 datasets, ~3.35M samples) outperforms Med-PaLM M 562B on PathVQA (75.9 BLEU-1) with a fraction of the parameters — the same pattern as Pythagoras. MARD (7B) gains +13.9pp over the best baseline and +6.7pp over GPT-4o on mechanism-level drug-drug interaction prediction on DrugBank April 2026, with robust generalization to unseen pairs via PRM-weighted DPO distillation. Both confirm that dense domain pretraining plus fine curriculum beats large generalist models on structured tasks.

Today's 5 picks
01
02
03
04
05
Arbor shows tree-search multi-agent holds where single agents collapse — and Apple's fp8 turns out to be emulated · Signal IA