Self-Distilled Policy Gradient
Signal
72
Hype
18
In three linesSDPG combines policy self-distillation with group-relative verifier advantages and KL regularization. The method uses full-vocabulary reverse KL divergence to supervise language model generations. Code available on GitHub.Read source
Your take?
Summary generated by Claude — human-verified