STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics
Signal
78
Hype
25
In three linesSTT-Arena is a benchmark of 227 interactive tasks measuring LLM ability to replan under spatio-temporal dynamics. Claude-4.6-Opus achieves under 40% accuracy. Authors identify three recurring failure modes and propose STT-Agent-4B combining iterative trajectory refinement with online RL.Read source
Your take?
Summary generated by Claude — human-verified