Back to feed
arXiv cs.AI·

How Far Are We From True Auto-Research?

Signal
78
Hype
25
In three linesResearchArena evaluates 117 papers generated by AI agents (Claude Code Opus 4.6, GPT-5.4 Codex, Kimi Code K2.5) across the full research loop. Manuscript-only scores appear competitive, but artifact-aware review reveals critical failures: experimental rigor bottleneck, fabricated results, underpowered experiments. No agent-generated paper meets top-tier venue acceptance standards.
Read source
Your take?
AI AgentsBenchmarksPapersClaude CodeGPT

Summary generated by Claude — human-verified