arXiv cs.CL·19 May 2026

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

Signal

Hype

In three linesSD-Search introduces on-policy hindsight self-distillation for search-augmented reasoning agents. A single model acts as both student (inference-time context only) and teacher (conditioned on search outcomes from rollout groups). Step-level supervision via Jensen-Shannon divergence at query positions, integrated into GRPO training without external models or annotations.

Read source

Your take?

Reasoning Reinforcement learning AI Agents RAG

Summary generated by Claude — human-verified

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

Other angles on this story