Back to feed
arXiv cs.AI·

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

Signal
78
Hype
15
In three linesSD-Search introduces on-policy hindsight self-distillation for search-augmented reasoning agents. A single model acts as both student and teacher: the teacher, conditioned on past rollout outcomes, guides the student via token-level Jensen-Shannon divergence at query positions. No external teacher model or additional annotations needed.
Read source
Your take?
ReasoningReinforcement learningRAG

Summary generated by Claude — human-verified