Back to feed
arXiv cs.AI·

Interactive Benchmarks

Signal
75
Hype
15
In three linesNew Interactive Benchmarks evaluation paradigm assesses model reasoning through budgeted multi-turn interaction. Two settings: Interactive Proofs (logic, UI2Html, mathematics with objective feedback) and Interactive Games (strategic reasoning). Reveals substantial gaps in current interactive capabilities.
Read source
Your take?
BenchmarksReasoningEvals

Summary generated by Claude — human-verified