Reddit r/LocalLLaMA·3 June 2026

Can LLMs Adhere to Strict 2D Spatial Constraints? (Testing with Sokoban)

Signal

Hype

In three linesSpatial reasoning benchmark on LLMs using Sokoban under zero-shot conditions. ChatGPT, Qwen3.7-max, and Gemini 3.5-thinking pass; Gemini 3.5-flash, Qwen 3.6/3.7-plus, GLM-5, and Gemma4 fail. Strict formatting (UP/DOWN/LEFT/RIGHT only) prevents chain-of-thought cheating.

Read source

Your take?

Benchmarks Reasoning GPT Gemini Qwen

Summary generated by Claude — human-verified

Can LLMs Adhere to Strict 2D Spatial Constraints? (Testing with Sokoban)

Other angles on this story