Back to feed
Reddit r/LocalLLaMA·

Agent Execution Tax: new procurement metric for browser agent benchmarks?

Signal
78
Hype
25
In three linesWebVoyager benchmark on 720 browser agent tasks: MiniMax M2.5 costs 2.3× less per successful task than Gemini 2.5 Flash. GLM-5 achieves 57.1% accuracy, Kimi K2.5 shows 0% parse retry rate. Open-weight models outperform Gemini not through intelligence but reliability. True cost exceeds per-token pricing once retries compound.
Read source
Your take?
AI AgentsBenchmarksOpen sourceEvals

Summary generated by Claude — human-verified