GeoWorld-VLM: Geometry from World Models for Vision-Language Models
Signal
72
Hype
18
In three linesGeoWorld-VLM enhances spatial reasoning in Vision-Language Models by distilling geometric structure from frozen camera-conditioned video world models. The method fine-tunes only the image encoder and multimodal projector, aligning post-projector features with world-model representations. Achieves ~4% improvements on What'sUp and VSR benchmarks.Read source
Your take?
Summary generated by Claude — human-verified