Are we really tilting? The mechanics of reward guidance in flow and diffusion models
Signal
78
Hype
15
In three linesReward guidance algorithms steer generative processes toward reward-tilted measures. The paper shows reward hacking stems from finite-particle plug-in estimation of the Doob h-function in practical implementations. Authors propose a closed-form reward damping schedule and validate on Gaussian targets, 2D checkerboard, and FLUX.1 text-to-image generation.Read source
Your take?
Summary generated by Claude — human-verified