Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents
Signal
82
Hype
15
In three linesCVT-RL, a policy-gradient algorithm with dense verifiable rewards, improves long-horizon language agent RL. On QA, ALFWorld, ScienceWorld, and web/tool tasks, task success rises from 71.8% (non-causal RL) to 78.9%, evidence F1 from 78.9 to 82.8, and measured hacking from 7.2% to 3.9%. Statistical tests yield p<0.01 after Holm correction.Read source
Your take?
Summary generated by Claude — human-verified