arXiv cs.CL·1 June 2026

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

Signal

Hype

In three linesEvaluation of semantic stability in 16 LLMs (general-purpose and medical) under clinically equivalent prompt reformulations. Proposes NLI-based verification framework and three sensitivity metrics (MVS, ΔC, WCI). Finding: domain specialization does not consistently improve robustness to meaning-preserving variations.

Read source

Your take?

Evals AI safety Reasoning Benchmarks

Summary generated by Claude — human-verified

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

Other angles on this story