Five papers published today share a common diagnosis: current evaluation metrics systematically underestimate real-world system failures. GLIDE (arXiv:2605.31278) attacks the problem at the root — LLM-as-judge annotations are biased, and unifying PPI++, Stratified PPI, and Predict-Then-Debias in a single Python library produces valid confidence intervals while cutting human annotation costs. This is eval infrastructure, not another benchmark. Meanwhile, the counterfactual study on clinical LLMs (CSS metric) shows that six frontier models ranked nearly identically on traditional metrics reverse their ordering entirely when oncology cases are mutated — and all models fail uniformly on surgical status changes, a blind spot invisible to standard coverage metrics. LongDS-Bench drives the point home: 68 multi-turn data analysis tasks on real Kaggle notebooks, best score at 48.45%, a 47-point drop between early and late turns. Long-horizon errors account for 52–69% of total failures. Data analysis agents do not hold context.
On the training side, VeriGate fixes a structural flaw in GRPO: when all trajectories receive the same reward, the gradient collapses. By injecting a Process Reward Model to assign token-level granular credit, VeriGate gains ~20 accuracy points on MATH with Qwen2.5-Instruct 1.5B and ~12 points on the 7B. This is not a marginal improvement — it is a fix to a fundamental optimization failure in current RL pipelines for reasoning.
Finally, the study on linear representations of synthetic deception (Pythia-1.4B, Gemma-2, Qwen2.5-7B, Llama-3.1-8B) confirms that linear probes detect lying with AUC ≥0.99 as early as layers 1–3, and that dishonesty representations consolidate in deeper layers. Operational takeaway: activation-based monitoring is feasible early in the network, before deceptive behavior is observable at output. For teams working on alignment or red-teaming, this is a concrete instrumentation direction.
GLIDE is an open-source Python library unifying prediction-powered inference methods (PPI++, Stratified PPI, Predict-Then-Debias) for evaluating agentic systems. It combines human annotations and LLM judgments into unbiased estimates with valid confidence intervals, reducing annotation costs while maintaining precision.
VeriGate extends GRPO by combining verifier rewards with step-level supervision. The method uses a Process Reward Model (PRM) to assign fine-grained credit to tokens, avoiding gradient collapse when all trajectories receive identical rewards. On MATH with Qwen2.5-Instruct (1.5B/7B), VeriGate improves accuracy by ~20% and ~12% respectively.
A new counterfactual evaluation metric (CSS) reveals that six frontier models ranked similarly on traditional coverage-based metrics rank in nearly opposite order when assessed on their ability to update clinical recommendations in response to oncology case mutations. All models fail on surgery-status interventions, a safety blind spot invisible to coverage metrics.
LongDS-Bench evaluates AI agents' ability to maintain analytical context over long horizons. The benchmark contains 68 multi-turn data analysis tasks (2,225 turns) from real Kaggle notebooks. Best models reach only 48.45% accuracy, with a 47-point performance drop from early to late turns. Long-horizon errors account for 52–69% of failures.
Multi-model study (Pythia-1.4B, Gemma-2, Qwen2.5-7B, Llama-3.1-8B) on linear representations of synthetic dishonesty. Linear probes detect deception with AUC ≥0.99 as early as layers 1-3. Dishonesty representations consolidate progressively in deeper layers, with implications for activation-based monitoring.