Back to feed
arXiv cs.AI·

Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

Signal
72
Hype
08
In three linesSTHTD-MP, a new off-policy temporal-difference method, replaces the covariance metric with the behavior-policy Bellman matrix in the primal-dual saddle-point formulation. Formal convergence analysis and spectral comparison with GTD2-MP show potential gains on benchmarks (Random Walk, Boyan Chain).
Read source
Your take?
Reinforcement learningPapersBenchmarks

Summary generated by Claude — human-verified