Back to feed
arXiv cs.CL·

STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

Signal
78
Hype
25
In three linesSTT-Arena is a benchmark of 227 interactive tasks measuring LLM ability to replan under spatio-temporal dynamics. Claude-4.6-Opus achieves under 40% accuracy. Authors identify three recurring failure modes and propose STT-Agent-4B combining iterative trajectory refinement with online RL.
Read source
Your take?
AI AgentsBenchmarksReinforcement learningReasoningClaude

Summary generated by Claude — human-verified