Back to feed
Reddit r/LocalLLaMA·

Nvidia LocateAnything - Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding. (10x faster than Qwen3-VL)

Signal
75
Hype
35
In three linesNvidia releases LocateAnything, a 3B vision-language grounding model. Uses parallel box decoding, 10x faster than Qwen3-VL. Code and demo available on HuggingFace.
Read source
Your take?
VisionOpen sourceBenchmarks

Summary generated by Claude — human-verified