arXiv cs.AI·20 May 2026

How Far Are We From True Auto-Research?

Signal

Hype

In three linesResearchArena evaluates 117 papers generated by AI agents (Claude Code Opus 4.6, GPT-5.4 Codex, Kimi Code K2.5) across the full research loop. Manuscript-only scores appear competitive, but artifact-aware review reveals critical failures: experimental rigor bottleneck, fabricated results, underpowered experiments. No agent-generated paper meets top-tier venue acceptance standards.

Read source

Your take?

AI Agents Benchmarks Papers Claude Code GPT

Summary generated by Claude — human-verified

How Far Are We From True Auto-Research?

Other angles on this story