arXiv cs.CL·19 May 2026

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Signal

Hype

In three linesInjecting mathematically wrong drafts from a smaller model (Qwen2.5-Math-1.5B) mismatched to the current problem into a stronger learner's (Mathstral-7B) GRPO context outperforms standard on-policy GRPO. On MATH-500, the mismatched-wrong variant reaches 71.98% (highest published result for this model), +1.62pp vs matched-wrong variant, without SFT or reward models.

Read source

Your take?

Reinforcement learning Reasoning Benchmarks Code generation

Summary generated by Claude — human-verified

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Other angles on this story