Back to feed
arXiv cs.AI·

PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

Signal
78
Hype
15
In three linesPSEBench is a 5,074-case benchmark for evaluating LLMs on patient safety event triage under Minnesota policy. The methodology uses clause cards to factorize regulatory text into auditable decision specifications, with closed-loop verification. Evaluation of 15 representative LLMs reveals capability trends and actionable gaps toward reliable LLM-based triage.
Read source
Your take?
BenchmarksEvalsAI safetyAI Agents

Summary generated by Claude — human-verified