arXiv cs.LG·3 June 2026

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

Signal

Hype

In three linesReward guidance algorithms steer generative processes toward reward-tilted measures. The paper shows reward hacking stems from finite-particle plug-in estimation of the Doob h-function in practical implementations. Authors propose a closed-form reward damping schedule and validate on Gaussian targets, 2D checkerboard, and FLUX.1 text-to-image generation.

Read source

Your take?

Reinforcement learning Reasoning Papers

Summary generated by Claude — human-verified

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

Other angles on this story