Back to feed
arXiv cs.LG·

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Signal
82
Hype
15
In three linesResearchClawBench benchmarks autonomous scientific research agents across 40 tasks spanning 10 scientific domains. Claude Code scores 21.5/100, Claude-Opus 20.7/100. Failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core.
Read source
Your take?
BenchmarksAI AgentsClaudePapers

Summary generated by Claude — human-verified