SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
Signal
78
Hype
15
In three linesSD-Search introduces on-policy hindsight self-distillation for search-augmented reasoning agents. A single model acts as both student (inference-time context only) and teacher (conditioned on search outcomes from rollout groups). Step-level supervision via Jensen-Shannon divergence at query positions, integrated into GRPO training without external models or annotations.Read source
Your take?
Summary generated by Claude — human-verified