Edition of2026-06-12

Arbor shows tree-search multi-agent holds where single agents collapse — and Apple's fp8 turns out to be emulated

By the editorial team

Two infrastructure papers stand out today. Arbor (arXiv:2606.12563) formalizes what many suspected: a single agent tasked with full-stack LLM inference optimization gains 33% then collapses within hours. The Orchestrator + Critic architecture with tree search reaches +193% Pareto improvement on throughput-latency over the same baselines. The number itself isn't surprising — tree search has been known since AlphaGo — but validating it empirically on LLM inference optimization, with an explicit checks-and-balances mechanism, finally gives a reproducible blueprint for teams building long-running ops agents. Rigel delivers a symmetric piece of bad news on the hardware side: on Apple M4 Max, the matmul2d fp8 (E4M3) operation via Metal 4.1 is emulated on GPU shader cores, with no dedicated matrix datapath, running at 0.94x fp16 throughput. Anyone who sized local inference on M4 Max expecting a real fp8 gain needs to revisit their numbers. Rigel's fused GEMM kernel recovers +6.5–12.9% in cache-resident regime, but that's a workaround, not a fix.

On the formal reasoning front, Pythagoras-Prover (4B and 32B, open-source) beats DeepSeek-Prover-V2-671B on MiniF2F-Test with 167x fewer parameters (86.1% vs 82.4% for the 4B, 93.0% for the 32B). The gain comes from curriculum SFT combined with Augmented Lean Formalisation, not a parameter race. This is the clearest signal to date that efficient formal provers are a data and curriculum problem, not a model size problem. The 32B also solves 93/672 PutnamBench problems — modest, but measurable on a benchmark designed to be hard for humans.

Two domain papers round out the selection. OpenMedQ (14 datasets, ~3.35M samples) outperforms Med-PaLM M 562B on PathVQA (75.9 BLEU-1) with a fraction of the parameters — the same pattern as Pythagoras. MARD (7B) gains +13.9pp over the best baseline and +6.7pp over GPT-4o on mechanism-level drug-drug interaction prediction on DrugBank April 2026, with robust generalization to unseen pairs via PRM-weighted DPO distillation. Both confirm that dense domain pretraining plus fine curriculum beats large generalist models on structured tasks.

Today's 5 picks

arXiv cs.AI·SIG 82

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Arbor is a multi-agent framework introducing tree search as a cognition layer for autonomous agents. Validated on full-stack LLM inference optimization, it pairs an Orchestrator agent with a Critic agent in a checks-and-balances architecture. Arbor achieves 193% throughput-latency Pareto improvement over vendor-optimized baselines, versus 33% for a single agent that crashes within hours.

AI Agents Multi-agent Reasoning

arXiv cs.AI·SIG 82

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Pythagoras-Prover is an open-source family of efficient Lean theorem provers (4B and 32B parameters, including a diffusion-based prototype). Via curriculum SFT and Augmented Lean Formalisation (ALF), the 4B model outperforms DeepSeek-Prover-V2-671B on MiniF2F-Test (86.1% vs 82.4%) with 167x fewer parameters. The 32B achieves 93.0% on MiniF2F-Test and solves 93/672 PutnamBench problems.

Reasoning Code generation Benchmarks

arXiv cs.AI·SIG 82

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

OpenMedQ is a medical vision-language model pretrained on 14 datasets (~3.35M samples) covering pathology, radiology, microscopy, and clinical QA. It achieves 75.9 BLEU-1 on PathVQA (outperforming Med-PaLM M 562B) and 0.757 average macro-F1 on 8 unseen medical classification benchmarks.

Vision Benchmarks Open source

arXiv cs.CL·SIG 82

MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction

MARD is a 7B parameter model for mechanism-level drug-drug interaction prediction (enzyme, pharmacodynamic axis). Uses reasoning distillation with process-reward-weighted DPO and mechanism-aware retrieval. On April-2026 DrugBank: +13.9pp over best baseline, +6.7pp over GPT-4o, with robust generalization to unseen drug pairs.

Reasoning Fine-tuning Reinforcement learning

arXiv cs.CL·SIG 82

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

Rigel empirically characterizes Metal 4.1 tensor compute path on Apple M4 Max. Researchers find fp8 (E4M3) matmul2d is emulated, not accelerated (0.94x fp16 throughput), executes on GPU shader cores without dedicated matrix datapath, and accumulates in ≥fp32. Hand-fused GEMM+bias+GELU kernel gains +6.5-12.9% in cache-resident regime.

Benchmarks Infrastructure Code generation