Back to feed
arXiv cs.LG·

Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

Signal
72
Hype
18
In three linesQ-ALIGN DT aligns conditioned sequence models by ensuring the Q-value of the output policy matches the input return-to-go (RTG). The method uses a Q function for dense guidance and RTG-perturbation fine-tuning. Results: improved controllability on D4RL benchmark and generalization to velocity-tracking tasks where prior methods fail.
Read source
Your take?
Reinforcement learningReasoningBenchmarks

Summary generated by Claude — human-verified