Back to feed
arXiv cs.CL·

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Signal
72
Hype
18
In three linesRL-based method for adaptive sampling control at test-time on LLMs. A lightweight controller trained with RL dynamically decides to stop or continue sampling, balancing answer correctness, latency, and computation cost. MDP formulation with Lagrangian interpretation. Outperforms ASC and ESC on trade-offs.
Read source
Your take?
Reinforcement learningReasoningEvals

Summary generated by Claude — human-verified