Back to feed
arXiv cs.AI·

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

Signal
82
Hype
15
In three linesTxBench-PP is a verified benchmark evaluating AI agents on small-molecule preclinical pharmacology. 100 evaluations span mechanism-of-action, pharmacodynamics, compound-target engagement, and safety. Across 16 configurations (11 models, 4,800 trajectories), Claude Opus 4.8 achieves 59.3% success rate, GPT-5.5 55.3%. No system reliably masters these decisions.
Read source
Your take?
AI AgentsBenchmarksClaudeGPTEvals

Summary generated by Claude — human-verified