arXiv cs.LG·5 June 2026

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

Signal

Hype

In three linesCVT-RL, a policy-gradient algorithm with dense verifiable rewards, improves long-horizon language agent RL. On QA, ALFWorld, ScienceWorld, and web/tool tasks, task success rises from 71.8% (non-causal RL) to 78.9%, evidence F1 from 78.9 to 82.8, and measured hacking from 7.2% to 3.9%. Statistical tests yield p<0.01 after Holm correction.

Read source

Your take?

Reinforcement learning AI Agents Reasoning Evals AI safety

Summary generated by Claude — human-verified

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

Other angles on this story