From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models
Signal
75
Hype
25
In three linesDemo2Reward optimizes VLM reward model language instructions at test-time using 3-10 expert demonstrations to reduce false positives in robotics. No additional training required. Validated on simulated tasks and real-world transfer.Read source
Your take?
Summary generated by Claude — human-verified