arXiv cs.AI·29 May 2026

Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

Signal

Hype

In three linesSTHTD-MP, a new off-policy temporal-difference method, replaces the covariance metric with the behavior-policy Bellman matrix in the primal-dual saddle-point formulation. Formal convergence analysis and spectral comparison with GTD2-MP show potential gains on benchmarks (Random Walk, Boyan Chain).

Read source

Your take?

Reinforcement learning Papers Benchmarks

Summary generated by Claude — human-verified

Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

Other angles on this story