Edition of2026-06-01

Evaluation under pressure: annotation bias, GRPO collapse, and agents that lose the thread past 48% accuracy

Five papers published today share a common diagnosis: current evaluation metrics systematically underestimate real-world system failures. GLIDE (arXiv:2605.31278) attacks the problem at the root — LLM-as-judge annotations are biased, and unifying PPI++, Stratified PPI, and Predict-Then-Debias in a single Python library produces valid confidence intervals while cutting human annotation costs. This is eval infrastructure, not another benchmark. Meanwhile, the counterfactual study on clinical LLMs (CSS metric) shows that six frontier models ranked nearly identically on traditional metrics reverse their ordering entirely when oncology cases are mutated — and all models fail uniformly on surgical status changes, a blind spot invisible to standard coverage metrics. LongDS-Bench drives the point home: 68 multi-turn data analysis tasks on real Kaggle notebooks, best score at 48.45%, a 47-point drop between early and late turns. Long-horizon errors account for 52–69% of total failures. Data analysis agents do not hold context.

On the training side, VeriGate fixes a structural flaw in GRPO: when all trajectories receive the same reward, the gradient collapses. By injecting a Process Reward Model to assign token-level granular credit, VeriGate gains ~20 accuracy points on MATH with Qwen2.5-Instruct 1.5B and ~12 points on the 7B. This is not a marginal improvement — it is a fix to a fundamental optimization failure in current RL pipelines for reasoning.

Finally, the study on linear representations of synthetic deception (Pythia-1.4B, Gemma-2, Qwen2.5-7B, Llama-3.1-8B) confirms that linear probes detect lying with AUC ≥0.99 as early as layers 1–3, and that dishonesty representations consolidate in deeper layers. Operational takeaway: activation-based monitoring is feasible early in the network, before deceptive behavior is observable at output. For teams working on alignment or red-teaming, this is a concrete instrumentation direction.

Today's 5 picks
01
02
03
04
05
Evaluation under pressure: annotation bias, GRPO collapse, and agents that lose the thread past 48% accuracy · Signal IA