Back to feed
arXiv cs.AI·

Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction

Signal
72
Hype
08
In three linesTheoretical paper on stabilizing off-policy temporal-difference learning with function approximation. Proposes BA-TDC and BA-TDRC, replacing TDC's auxiliary matrix with behavior Bellman matrix. Linear analysis with convergence proof under Hurwitz stability condition; experiments on Markov chains and classical counterexamples.
Read source
Your take?
Reinforcement learningPapersBenchmarks

Summary generated by Claude — human-verified