Back to feed
arXiv cs.AI·

OmniCode: A Benchmark for Evaluating Software Engineering Agents

Signal
78
Hype
15
In three linesOmniCode is a benchmark for evaluating AI agents on software engineering tasks. It contains 1794 tasks across Python, Java, and C++ covering bug fixing, test generation, code review fixing, and style fixing. Evaluations show SWE-Agent achieves only 25% on C++ test generation with DeepSeek-V3.1.
Read source
Your take?
BenchmarksCode generationAI AgentsEvals

Summary generated by Claude — human-verified