arXiv cs.LG·28 May 2026

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

Signal

Hype

In three linesSC-SDPO improves LLM self-distillation by weighting losses with √[p(1-p)], creating an implicit curriculum. Experiments on Qwen3-8B (+3.2/+4.3 mean@16/maj@16) and OLMo-3-7B (+1.8/+3.0) show stable gains with zero computational overhead.

Read source

Your take?

Reasoning Reinforcement learning Papers

Summary generated by Claude — human-verified

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

Other angles on this story