On-policy distillation: one of the hottest terms on PapersWithCode [R]
Signal
72
Hype
35
In three linesOn-policy distillation (OPD) is a key post-training technique used by Qwen 3.6/3.7, GLM-5.1, and DeepSeek-V4. The method uses an auxiliary model to identify errors in trajectories and inject correction tokens, allowing the main model to learn without regenerating new rollouts.Read source
Your take?
Summary generated by Claude — human-verified