Can LLMs Adhere to Strict 2D Spatial Constraints? (Testing with Sokoban)
Signal
72
Hype
35
In three linesSpatial reasoning benchmark on LLMs using Sokoban under zero-shot conditions. ChatGPT, Qwen3.7-max, and Gemini 3.5-thinking pass; Gemini 3.5-flash, Qwen 3.6/3.7-plus, GLM-5, and Gemma4 fail. Strict formatting (UP/DOWN/LEFT/RIGHT only) prevents chain-of-thought cheating.Read source
Your take?
Summary generated by Claude — human-verified