Reddit r/MachineLearning·4 June 2026

On-policy distillation: one of the hottest terms on PapersWithCode [R]

Signal

Hype

In three linesOn-policy distillation (OPD) is a key post-training technique used by Qwen 3.6/3.7, GLM-5.1, and DeepSeek-V4. The method uses an auxiliary model to identify errors in trajectories and inject correction tokens, allowing the main model to learn without regenerating new rollouts.

Read source

Your take?

Fine-tuning Reinforcement learning Qwen DeepSeek Reasoning

Summary generated by Claude — human-verified

On-policy distillation: one of the hottest terms on PapersWithCode [R]

Other angles on this story