Back to feed
arXiv cs.CL·

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Signal
82
Hype
15
In three linesInjecting mathematically wrong drafts from a smaller model (Qwen2.5-Math-1.5B) mismatched to the current problem into a stronger learner's (Mathstral-7B) GRPO context outperforms standard on-policy GRPO. On MATH-500, the mismatched-wrong variant reaches 71.98% (highest published result for this model), +1.62pp vs matched-wrong variant, without SFT or reward models.
Read source
Your take?
Reinforcement learningReasoningBenchmarksCode generation

Summary generated by Claude — human-verified