Reddit r/LocalLLaMA·23 May 2026

Apex-Testing: real-world, real repos, agentic coding benchmark (Update)

Signal

Hype

In three linesApex-Testing, an agentic coding benchmark based on 65-70 real GitHub repos, updated to 95% with recent models. 70 tasks across 8 categories test AI agents on production codebases. ELO leaderboard, cost/time metrics and model comparisons available. Qwen 3.7 Max, Deepseek v4 and other models still being completed.

Read source

Your take?

AI Agents Code generation Benchmarks Evals

Summary generated by Claude — human-verified

Apex-Testing: real-world, real repos, agentic coding benchmark (Update)

Other angles on this story