PaperBench: Evaluating AI’s Ability to Replicate AI Research
OpenAI introduces PaperBench, a benchmark measuring AI agents' ability to replicate state-of-the-art AI research. The test evaluates whether models can autonomously implement complex scientific papers.