arXiv cs.AI·18 June 2026

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

Signal

Hype

In three linesTxBench-PP is a verified benchmark evaluating AI agents on small-molecule preclinical pharmacology. 100 evaluations span mechanism-of-action, pharmacodynamics, compound-target engagement, and safety. Across 16 configurations (11 models, 4,800 trajectories), Claude Opus 4.8 achieves 59.3% success rate, GPT-5.5 55.3%. No system reliably masters these decisions.

Read source

Your take?

AI Agents Benchmarks Claude GPT Evals

Summary generated by Claude — human-verified

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

Other angles on this story