Back to feed
Reddit r/LocalLLaMA·

Apex-Testing: real-world, real repos, agentic coding benchmark (Update)

Signal
78
Hype
25
In three linesApex-Testing, an agentic coding benchmark based on 65-70 real GitHub repos, updated to 95% with recent models. 70 tasks across 8 categories test AI agents on production codebases. ELO leaderboard, cost/time metrics and model comparisons available. Qwen 3.7 Max, Deepseek v4 and other models still being completed.
Read source
Your take?
AI AgentsCode generationBenchmarksEvals

Summary generated by Claude — human-verified