Back to feed
Reddit r/LocalLLaMA·

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Signal
72
Hype
28
In three linesVector Policy Optimization (VPO) is an RL algorithm training language models to produce diverse solutions by anticipating multiple vector-valued reward functions. VPO replaces the GRPO advantage estimator and matches or beats scalar RL baselines across four tasks, with widening gaps as search budget grows.
Read source
Your take?
Reinforcement learningReasoningCode generationEvals

Summary generated by Claude — human-verified