Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning
Signal
72
Hype
18
In three linesQ-ALIGN DT aligns conditioned sequence models by ensuring the Q-value of the output policy matches the input return-to-go (RTG). The method uses a Q function for dense guidance and RTG-perturbation fine-tuning. Results: improved controllability on D4RL benchmark and generalization to velocity-tracking tasks where prior methods fail.Read source
Your take?
Summary generated by Claude — human-verified