Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers
Researchers demonstrate that reinforcement learning can encode reusable solvers into LLM weights rather than solving each instance at inference time. Fine-tuning Qwen2.5-Coder-14B with GRPO on Synergistic Dependency Selection, the model converges to Simulated Annealing with 5.0% gap to optimal solver, 91× cheaper than Best-of-64 baseline.