Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction
Signal
72
Hype
08
In three linesTheoretical paper on stabilizing off-policy temporal-difference learning with function approximation. Proposes BA-TDC and BA-TDRC, replacing TDC's auxiliary matrix with behavior Bellman matrix. Linear analysis with convergence proof under Hurwitz stability condition; experiments on Markov chains and classical counterexamples.Read source
Your take?
Summary generated by Claude — human-verified