arXiv cs.LG·4 June 2026

Self-Distilled Policy Gradient

Signal

Hype

In three linesSDPG combines policy self-distillation with group-relative verifier advantages and KL regularization. The method uses full-vocabulary reverse KL divergence to supervise language model generations. Code available on GitHub.

Read source

Your take?

Reinforcement learning Reasoning Papers

Summary generated by Claude — human-verified

Self-Distilled Policy Gradient

Other angles on this story