Reducing Credit Assignment Variance via Counterfactual Reasoning Paths
New credit assignment method for reinforcement learning with LLMs. IBPO (Implicit Behavior Policy Optimization) uses counterfactual trajectories to convert sparse terminal rewards into step-sensitive learning signals, reducing gradient variance and improving stability on mathematical and code reasoning benchmarks.