arXiv cs.AI·19 May 2026

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Signal

Hype

In three linesInjecting mathematically wrong drafts from a smaller model (Qwen2.5-Math-1.5B) into stronger learner (Mathstral-7B) GRPO training improves performance on MATH-500 (+1.62pp) and AIME 2025/2026 (+14.2pp at pass@1024). Intentional mismatch between problems and drafts is critical: 71.98% on MATH-500, highest published result for this model.

Read source

Your take?

Reinforcement learning Reasoning Benchmarks Code generation

Summary generated by Claude — human-verified

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Other angles on this story