arXiv cs.CL·3 June 2026

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Signal

Hype

In three linesRL-based method for adaptive sampling control at test-time on LLMs. A lightweight controller trained with RL dynamically decides to stop or continue sampling, balancing answer correctness, latency, and computation cost. MDP formulation with Lagrangian interpretation. Outperforms ASC and ESC on trade-offs.

Read source

Your take?

Reinforcement learning Reasoning Evals

Summary generated by Claude — human-verified

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Other angles on this story