Weak-to-Strong Elicitation via Mismatched Wrong Drafts
Signal
82
Hype
18
In three linesInjecting mathematically wrong drafts from a smaller model (Qwen2.5-Math-1.5B) into stronger learner (Mathstral-7B) GRPO training improves performance on MATH-500 (+1.62pp) and AIME 2025/2026 (+14.2pp at pass@1024). Intentional mismatch between problems and drafts is critical: 71.98% on MATH-500, highest published result for this model.Read source
Your take?
Summary generated by Claude — human-verified