arXiv cs.LG·29 May 2026

OISD: On-Policy Internal Self-Distillation of Language Models

Signal

Hype

In three linesOISD introduces on-policy internal self-distillation to improve language model reasoning. The final layer acts as a detached teacher for intermediate layers via logit alignment (reasoning behaviors) and attention alignment (attention patterns), without external privileged information. Positive results across four mathematical reasoning tasks.

Read source

Your take?

Reinforcement learning Reasoning Papers

Summary generated by Claude — human-verified

OISD: On-Policy Internal Self-Distillation of Language Models

Other angles on this story