Two of today's papers deal with agents in real production, not lab benchmarks. MapAgent (arXiv:2606.04513, Baidu Maps) is the most concrete case: a Judge-Planner-Worker loop deployed across 360+ Chinese cities, with 95% automation measured on lane-level map generation. What's notable isn't the raw performance but the architecture — explicit separation between visual perception, specification verification, and deterministic editing. SGDR (WebArena, GPT-4.1) follows the same logic on the web agent side: dynamic retrieval of sub-procedures grounded in the current page state rather than a static skill library. 37.5% success on WebArena with GPT-4.1, +10.6 points over baseline. Both systems converge on the same principle: a pure generalist agent doesn't scale — you need specialized roles with explicit state.
On the inference side, SparDA (arXiv:2606.04511, NVlabs) and Recover-LoRA (arXiv:2606.04238) attack the same problem from opposite ends. SparDA adds a fourth projection per layer (Forecast) to predict which KV blocks the next layer will need, overlapping CPU-GPU prefetching with current execution — result: 1.7× decode speedup, up to 5.3× throughput on 8B models in long-context settings. Recover-LoRA starts from the other end: aggressive 2-bit quantization with a mixed W2/W4 strategy on MLP layers, then accuracy recovery via logit distillation on 10k synthetic samples. On Qwen3-4B, 80–95% accuracy recovered, +7.5–23.3% throughput gain. The two papers are complementary — SparDA optimizes attention over long contexts, Recover-LoRA compresses weights without sacrificing quality. Potentially stackable.
Curation-Bench (arXiv:2606.04261) is the most underrated signal of the day. The evaluation shows that generalist agents reach published baselines in ten iterations on training data curation tasks — but stay stuck on local variants without scaffolding. With method citation and adaptation, an agent autonomously composes a policy that beats baselines using 10× less data. This isn't a result about the quality of the data produced; it's a result about agents' ability to automate the ML pipeline itself. Worth tracking for teams still spending significant time on dataset preparation.
MapAgent is a multi-agent architecture for city-scale lane-level map generation. The system couples visual perception, specification verification, and deterministic editing via a Judge-Planner-Worker loop. Integrated into Baidu Maps for 360+ cities, it achieves 95% production automation.
SparDA introduces a decoupled sparse attention architecture for efficient long-context LLM inference. A fourth per-layer projection (Forecast) predicts KV blocks needed by the next layer, overlapping CPU-to-GPU prefetch with current execution. On 8B models, SparDA achieves 1.25× prefill speedup and 1.7× decode speedup, reaching up to 5.3× higher decode throughput.
Recover-LoRA extends a data-free accuracy recovery method to 2-bit quantized LLMs. A mixed-precision strategy selectively quantizes MLP gate/up layers to W2 while keeping others at W4, achieving 7.5–23.3% throughput gains. Low-rank adapters trained via logit distillation on synthetic data recover 80–95% accuracy on Qwen3-4B using only 10k samples.
SGDR, an online skill learning method, enables web agents to reuse sub-procedures at each execution step. Unlike static approaches, SGDR dynamically retrieves skills based on current webpage state and task goal. On WebArena, it achieves 37.5% success with GPT-4.1 and 24.3% with Qwen3-4B, outperforming strongest baselines by 10.6% and 10.0% respectively.
Curation-Bench evaluates whether generalist AI agents can automate training data curation. Agents reach published baselines within ten iterations but tend toward local policy variants. With scaffolding requiring method citation and adaptation, an agent autonomously composes a data-selection policy outperforming strong baselines at one-tenth their data budget.