Back to feed
arXiv cs.AI·

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

Signal
78
Hype
15
In three linesInteractive reasoning evaluation benchmark with 474 executable games. LLMs receive only task rules, must query a hidden environment, integrate partial observations, and decide when to submit answers. Evaluates contextual robustness, metacognitive adaptation, and interaction efficiency across frontier models.
Read source
Your take?
ReasoningEvalsBenchmarks

Summary generated by Claude — human-verified