Back to feed
arXiv cs.LG·

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

Signal
78
Hype
15
In three linesReward guidance algorithms steer generative processes toward reward-tilted measures. The paper shows reward hacking stems from finite-particle plug-in estimation of the Doob h-function in practical implementations. Authors propose a closed-form reward damping schedule and validate on Gaussian targets, 2D checkerboard, and FLUX.1 text-to-image generation.
Read source
Your take?
Reinforcement learningReasoningPapers

Summary generated by Claude — human-verified