Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance
Signal
78
Hype
15
In three linesTOPD (Trajectory-aware On-Policy Distillation) improves LLM reasoning by using near-future trajectory information to identify truly divergent states. On AIME24/25, TOPD reaches 63.3%/53.3% vs 60.0%/46.7% in standard OPD, showing 30% of high-loss tokens are false positives.Read source
Your take?
Summary generated by Claude — human-verified