Back to feed
arXiv cs.LG·

OISD: On-Policy Internal Self-Distillation of Language Models

Signal
78
Hype
15
In three linesOISD introduces on-policy internal self-distillation to improve language model reasoning. The final layer acts as a detached teacher for intermediate layers via logit alignment (reasoning behaviors) and attention alignment (attention patterns), without external privileged information. Positive results across four mathematical reasoning tasks.
Read source
Your take?
Reinforcement learningReasoningPapers

Summary generated by Claude — human-verified