Edition of2026-06-11

RAG on mobile NPU, SLO-aware multi-agent orchestration, and science-reproducing agents: AI moves down the stack

By the editorial team

Two papers today address the same problem from opposite ends: where and how to run inference efficiently. The RAG implementation on Snapdragon X Elite (arXiv:2606.11447 — embedding, reranking, and LLM generation fully on the Hexagon NPU) delivers 18.1× faster prefilling and 4× lower system energy than CPU, with quality judged equivalent by GPT-4.1 (9.32 vs 8.95). On the other end, INFRAMIND (arXiv:2606.11440) orchestrates agents with real-time awareness of infrastructure state — GPU queues, KV cache, latencies — achieving 7× lower latency and 99.9% SLO compliance under overload. The shared signal: inference optimization is no longer just a model-level problem, it's a systems-level one, whether you're on a 4W NPU or a pressured cluster.

On the agents front, ISE (arXiv:2606.11520) is the most actionable result: fine-tuning Qwen3-8B on 23,132 multi-turn OS agent trajectories generated with live sandbox execution pushes ClawEval from 19.3 to 37.7 pass@1, beating GPT-4o zero-shot and Qwen3-32B. This is a direct demonstration that trajectory data quality — grounded in real execution rather than static synthesis — matters more than model size. SocSci-Repro-Bench (arXiv:2606.11447) rounds out the evaluation side: 221 tasks reproducing published social science findings, Claude Code ahead of Codex, with agents able to identify research questions rather than simply memorizing outputs. A useful benchmark for anyone deploying agents on real analytical workflows.

ProHiFlo (arXiv:2606.11243) is the most vertical signal: 58.9% success on enzymatic active site scaffolding versus 41.2% for RFDiffusion, with 4× fewer sampling steps. The SE(3)-equivariant coarse-to-fine architecture with functional guidance via pre-trained predictors is a clear direction for de novo protein generation. Less immediately actionable for most practitioners, but worth tracking if you work on drug discovery or biodesign pipelines.

Today's 5 picks

arXiv cs.CL·SIG 82

AI Coding Agents Can Reproduce Social Science Findings

SocSci-Repro-Bench, a benchmark of 221 tasks in social sciences, evaluates AI agents' ability to reproduce published findings. Claude Code substantially outperforms Codex, with reproduction rates exceeding previous LLM-based agent benchmarks. Agents also perform strongly on reasoning tasks identifying research questions and show results are not primarily driven by memorization.

Claude Code Benchmarks Code generation

arXiv cs.CL·SIG 82

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

First end-to-end RAG pipeline running all neural stages on mobile NPU (Snapdragon X Elite Hexagon). Embedding, reranking, LLM generation on-device. On 120-query Wikipedia benchmark: 18.1x faster LLM prefilling, 4.0x lower system energy vs CPU, answer quality parity (GPT-4.1 judge: 9.32 vs 8.95 CPU).

RAG Embeddings

arXiv cs.AI·SIG 82

INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

INFRAMIND is a framework for multi-agent orchestration that integrates real-time infrastructure state (GPU queue depths, KV-cache pressure, latencies). Via adaptive planning, per-step routing, and intelligent scheduling, it optimizes model selection and topologies under concurrent load. Results: +7.6pp accuracy gain at low load, 7x lower latency, 99.9% SLO compliance under high load.

Multi-agent AI Agents Reinforcement learning

arXiv cs.LG·SIG 82

ProHiFlo: Hierarchical Flow Matching with Functional Guidance for De Novo Protein Generation

ProHiFlo is a hierarchical flow matching framework for de novo protein generation. It combines coarse-to-fine generation (backbone then atoms), functional guidance via pretrained predictors, and SE(3)-equivariant architecture. On enzyme active site scaffolding, ProHiFlo achieves 58.9% success rate vs 41.2% for RFDiffusion, with 4× fewer sampling steps.

Papers Benchmarks Reasoning

arXiv cs.CL·SIG 82

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

ISE is a three-stage synthesis paradigm for generating multi-turn OS-agent trajectories with live execution. 43,956 structured intents, 23,132 trajectories (avg 8.12 user turns), execution in isolated OS workspace. Fine-tuning Qwen3-8B on ISETrace: ClawEval 19.3→37.7 pass@1, outperforms zero-shot GPT-4o and Qwen3-32B.

AI Agents Benchmarks Code generation