See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
Signal
75
Hype
25
In three linesSWIM aligns vision-language representations for fine-grained video object understanding from text prompts alone. Uses mask supervision during training to guide cross-modal attention. Constructs NL-Refer dataset with precise natural language referring expressions. Outperforms visual-prompt-based methods on fine-grained benchmarks.Read source
Your take?
Summary generated by Claude — human-verified