arXiv cs.LG·1 June 2026

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

Signal

Hype

In three linesA new counterfactual evaluation metric (CSS) reveals that six frontier models ranked similarly on traditional coverage-based metrics rank in nearly opposite order when assessed on their ability to update clinical recommendations in response to oncology case mutations. All models fail on surgery-status interventions, a safety blind spot invisible to coverage metrics.

## CSS: Clinical LLM Rankings Invert Under Counterfactual Evaluation

### 1. What Is Actually Shown

The paper introduces the **Causal Sensitivity Score (CSS)**, a pre-registered interventional metric that tests whether a model *updates* its oncology recommendations when patient data changes — not whether it *covers* the right therapeutic options. The distinction is fundamental: a model can list correct treatments for stage III cancer while producing identical output for stage I. The Consensus Match Score (CMS), a weighted recall metric dominant in current clinical benchmarks, does not detect this failure mode.

Protocol: 224 oncology tumor-board cases, 6 frontier models from 3 labs, 5 mutation types (biomarker flip, prior-treatment failure, biomarker removal, surgery-status change, stage perturbation), three-level scoring {0, 0.5, 1.0} based on update direction.

### 2. The Central Result: Near-Total Rank Inversion

All 6 models change rank between CMS and CSS. The worst CMS model becomes the best CSS model. A model in the upper-mid CMS tier drops to last place on CSS. This is not a minor reshuffling — it is a structural inversion meaning that current selection criteria for clinical LLMs actively optimize the wrong property.

Prior to this work, the state of the art in clinical evaluation relied on coverage metrics: does the model mention the relevant options? CSS shifts the question to: does the model *causally reason* about patient data? For current models, these two questions have orthogonal answers.

### 3. The Universal Blind Spot: Surgery Status

Every frontier model fails on surgery-status interventions, with a maximum CSS of **17.2%** on case family D. This is particularly alarming clinically: surgical eligibility is one of the most structurally decisive pivots in oncology — it determines curative vs. palliative intent, neoadjuvant chemotherapy sequencing, and tumor board framing. A model that ignores this signal is not making a marginal error; it is operating in a reasoning space disconnected from clinical reality.

CMS does not expose this because a model can mention surgery, chemotherapy, and radiotherapy in its output regardless of the patient's surgical status — coverage is satisfied, causal sensitivity is zero.

### 4. Transfer to ReAct Agents and RL Implications

The agent experiment is instructive: tool access improves CSS for 5 of 6 models (+2.5 to +20.3 percentage points). But the lowest-CSS model retrieves the same chart sections and produces the same recommendations — tooling does not fix a structural responsiveness deficit. This isolates the cause: it is not an information-access problem, it is a causal integration problem.

The authors propose CSS as a dense reward signal for future agentic RL systems. This is the paper's most forward-looking contribution: models trained on coverage metrics optimize coverage. A CSS signal in the training loop would force learning of sensitivity to clinical perturbations.

### Who Loses Under This Framework

**Clinical benchmark providers** built on CMS or equivalent metrics see their evaluation infrastructure partially invalidated. **Product teams** that selected models based on high CMS scores may have deployed precisely the least responsive models to clinical changes. **Labs** whose models perform well on CMS but poorly on CSS face a positioning question: are their models suited for dynamic clinical environments where cases evolve between consultations?

Validation by three medical professional raters and multi-judge replication strengthen the findings. Pre-registration of the metric is a methodologically significant decision that guards against p-hacking in a domain where deployment stakes are high. The CSS framework also offers a concrete path toward evaluating agentic clinical systems beyond static QA — a gap that has been acknowledged but not operationalized until now.

Read source

Your take?

Benchmarks Evals AI Agents AI safety Alignment

Summary generated by Claude — human-verified

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

Other angles on this story