arXiv cs.AI·19 May 2026

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

Signal

Hype

In three linesGeoWorld-VLM enhances spatial reasoning in Vision-Language Models by distilling geometric structure from frozen camera-conditioned video world models. The method fine-tunes only the image encoder and multimodal projector, aligning post-projector features with world-model representations. Achieves ~4% improvements on What'sUp and VSR benchmarks.

Read source

Your take?

Vision Reasoning Fine-tuning

Summary generated by Claude — human-verified

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

Other angles on this story