Back to feed
Reddit r/MachineLearning·

On-policy distillation: one of the hottest terms on PapersWithCode [R]

Signal
72
Hype
35
In three linesOn-policy distillation (OPD) is a key post-training technique used by Qwen 3.6/3.7, GLM-5.1, and DeepSeek-V4. The method uses an auxiliary model to identify errors in trajectories and inject correction tokens, allowing the main model to learn without regenerating new rollouts.
Read source
Your take?
Fine-tuningReinforcement learningQwenDeepSeekReasoning

Summary generated by Claude — human-verified