Two papers today address the same problem from opposite ends: where and how to run inference efficiently. The RAG implementation on Snapdragon X Elite (arXiv:2606.11447 — embedding, reranking, and LLM generation fully on the Hexagon NPU) delivers 18.1× faster prefilling and 4× lower system energy than CPU, with quality judged equivalent by GPT-4.1 (9.32 vs 8.95). On the other end, INFRAMIND (arXiv:2606.11440) orchestrates agents with real-time awareness of infrastructure state — GPU queues, KV cache, latencies — achieving 7× lower latency and 99.9% SLO compliance under overload. The shared signal: inference optimization is no longer just a model-level problem, it's a systems-level one, whether you're on a 4W NPU or a pressured cluster.
On the agents front, ISE (arXiv:2606.11520) is the most actionable result: fine-tuning Qwen3-8B on 23,132 multi-turn OS agent trajectories generated with live sandbox execution pushes ClawEval from 19.3 to 37.7 pass@1, beating GPT-4o zero-shot and Qwen3-32B. This is a direct demonstration that trajectory data quality — grounded in real execution rather than static synthesis — matters more than model size. SocSci-Repro-Bench (arXiv:2606.11447) rounds out the evaluation side: 221 tasks reproducing published social science findings, Claude Code ahead of Codex, with agents able to identify research questions rather than simply memorizing outputs. A useful benchmark for anyone deploying agents on real analytical workflows.
ProHiFlo (arXiv:2606.11243) is the most vertical signal: 58.9% success on enzymatic active site scaffolding versus 41.2% for RFDiffusion, with 4× fewer sampling steps. The SE(3)-equivariant coarse-to-fine architecture with functional guidance via pre-trained predictors is a clear direction for de novo protein generation. Less immediately actionable for most practitioners, but worth tracking if you work on drug discovery or biodesign pipelines.
SocSci-Repro-Bench, a benchmark of 221 tasks in social sciences, evaluates AI agents' ability to reproduce published findings. Claude Code substantially outperforms Codex, with reproduction rates exceeding previous LLM-based agent benchmarks. Agents also perform strongly on reasoning tasks identifying research questions and show results are not primarily driven by memorization.
First end-to-end RAG pipeline running all neural stages on mobile NPU (Snapdragon X Elite Hexagon). Embedding, reranking, LLM generation on-device. On 120-query Wikipedia benchmark: 18.1x faster LLM prefilling, 4.0x lower system energy vs CPU, answer quality parity (GPT-4.1 judge: 9.32 vs 8.95 CPU).
INFRAMIND is a framework for multi-agent orchestration that integrates real-time infrastructure state (GPU queue depths, KV-cache pressure, latencies). Via adaptive planning, per-step routing, and intelligent scheduling, it optimizes model selection and topologies under concurrent load. Results: +7.6pp accuracy gain at low load, 7x lower latency, 99.9% SLO compliance under high load.
ProHiFlo is a hierarchical flow matching framework for de novo protein generation. It combines coarse-to-fine generation (backbone then atoms), functional guidance via pretrained predictors, and SE(3)-equivariant architecture. On enzyme active site scaffolding, ProHiFlo achieves 58.9% success rate vs 41.2% for RFDiffusion, with 4× fewer sampling steps.
ISE is a three-stage synthesis paradigm for generating multi-turn OS-agent trajectories with live execution. 43,956 structured intents, 23,132 trajectories (avg 8.12 user turns), execution in isolated OS workspace. Fine-tuning Qwen3-8B on ISETrace: ClawEval 19.3→37.7 pass@1, outperforms zero-shot GPT-4o and Qwen3-32B.