Vector Policy Optimization: Training for Diversity Improves Test-Time Search
Signal
72
Hype
28
In three linesVector Policy Optimization (VPO) is an RL algorithm training language models to produce diverse solutions by anticipating multiple vector-valued reward functions. VPO replaces the GRPO advantage estimator and matches or beats scalar RL baselines across four tasks, with widening gaps as search budget grows.Read source
Your take?
Summary generated by Claude — human-verified