Back to feed
arXiv cs.AI·

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

Signal
78
Hype
15
In three linesDevBench is a telemetry-driven benchmark evaluating LLMs on 1,800 realistic code completion tasks across 6 programming languages. 9 SOTA models tested, best score 43.5% Pass@1. Combines functional correctness, similarity metrics, and LLM-judge assessments on usefulness and contextual relevance.
Read source
Your take?
Code generationBenchmarksEvals

Summary generated by Claude — human-verified