arXiv cs.LG·22 May 2026

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

Signal

Hype

In three linesAuthors show teacher-token reliability in reasoning self-distillation depends on position within trajectory, not local entropy. They propose Position-Weighted OPSD (PW-OPSD), applying increasing position weights to token supervision. On Qwen3-4B, AIME 2024/2025 improve by +1.0/+1.1 points; validation on DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think confirms gains.

Read source

Your take?

Reasoning Fine-tuning Benchmarks

Summary generated by Claude — human-verified

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

Other angles on this story