arXiv cs.CL·19 May 2026

STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

Signal

Hype

In three linesSTT-Arena is a benchmark of 227 interactive tasks measuring LLM ability to replan under spatio-temporal dynamics. Claude-4.6-Opus achieves under 40% accuracy. Authors identify three recurring failure modes and propose STT-Agent-4B combining iterative trajectory refinement with online RL.

Read source

Your take?

AI Agents Benchmarks Reinforcement learning Reasoning Claude

Summary generated by Claude — human-verified

STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

Other angles on this story