Back to feed
arXiv cs.CL·

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

Signal
78
Hype
15
In three linesTOPD (Trajectory-aware On-Policy Distillation) improves LLM reasoning by using near-future trajectory information to identify truly divergent states. On AIME24/25, TOPD reaches 63.3%/53.3% vs 60.0%/46.7% in standard OPD, showing 30% of high-loss tokens are false positives.
Read source
Your take?
ReasoningReinforcement learningPapers

Summary generated by Claude — human-verified