arXiv cs.AI·2 June 2026

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

Signal

Hype

In three linesInteractive reasoning evaluation benchmark with 474 executable games. LLMs receive only task rules, must query a hidden environment, integrate partial observations, and decide when to submit answers. Evaluates contextual robustness, metacognitive adaptation, and interaction efficiency across frontier models.

Read source

Your take?

Reasoning Evals Benchmarks

Summary generated by Claude — human-verified

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

Other angles on this story