Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data
Signal
72
Hype
18
In three linesPost-hoc multimodal alignment method using relative representations at token level to match separately pre-trained encoders with limited paired data. Learns learnable anchors in each modality space to induce consistent cross-modal similarity patterns. Outperforms existing methods on zero-shot classification, cross-modal retrieval, and zero-shot segmentation.Read source
Your take?
Summary generated by Claude — human-verified