Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games
Signal
78
Hype
15
In three linesInteractive reasoning evaluation benchmark with 474 executable games. LLMs receive only task rules, must query a hidden environment, integrate partial observations, and decide when to submit answers. Evaluates contextual robustness, metacognitive adaptation, and interaction efficiency across frontier models.Read source
Your take?
Summary generated by Claude — human-verified