arXiv cs.AI·19 May 2026

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

Signal

Hype

In three linesUnified study of LLM distillation showing SFT, DAgger, offline RL, and OPD decouple two orthogonal axes: prefix source and token-level KL direction. Authors propose KL mixing and entropy-gated length curriculum, improving Pass@k by 5.8 points and reducing average response length by 3x on math reasoning.

Read source

Your take?

Fine-tuning Reinforcement learning Reasoning Papers

Summary generated by Claude — human-verified

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

Other angles on this story