Back to feed
arXiv cs.LG·

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

Signal
72
Hype
18
In three linesSC-SDPO improves LLM self-distillation by weighting losses with √[p(1-p)], creating an implicit curriculum. Experiments on Qwen3-8B (+3.2/+4.3 mean@16/maj@16) and OLMo-3-7B (+1.8/+3.0) show stable gains with zero computational overhead.
Read source
Your take?
ReasoningReinforcement learningPapers

Summary generated by Claude — human-verified