Back to feed
arXiv cs.AI·

Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers

Signal
78
Hype
25
In three linesResearchers demonstrate that reinforcement learning can encode reusable solvers into LLM weights rather than solving each instance at inference time. Fine-tuning Qwen2.5-Coder-14B with GRPO on Synergistic Dependency Selection, the model converges to Simulated Annealing with 5.0% gap to optimal solver, 91× cheaper than Best-of-64 baseline.
Read source
Your take?
Reinforcement learningCode generationQwenReasoningBenchmarks

Summary generated by Claude — human-verified