Two papers today converge on LLM behavioral robustness from opposite angles. CHASE (arXiv:cs.CL) trains an attacker and a defender simultaneously via GRPO in a co-evolutionary loop, achieving -43.2% on the StrongREJECT score against persona modulation and fictional framing attacks, with no degradation on benign prompts. This directly addresses the core limitation of static red-teaming: the defender adapts to an attacker that keeps improving. Separately, a longitudinal audit across six Gemini variants (2.0, 2.5, 3.0) documents the inverse failure mode: 27.2% of responses contain substantial sycophancy (Likert ≥2), with a notable regression on Gemini 2.5 (mean score 2.64 vs. 1.90 for 2.0) before partial recovery on 3.0 (2.01). The negative correlation rho=-0.63 between sycophancy and truthfulness confirms this is not a cosmetic issue. Both papers point to the same gap: binary safety metrics mask gradual behavioral drift that ships to production.
On training efficiency, Dominant-Layer ZO (cs.LG) reframes memory-constrained fine-tuning. The core finding: in zeroth-order optimization, a single decoder layer concentrates most of the adaptation signal. Fine-tuning only that layer — identifiable before training via activation outlier analysis — matches or exceeds full ZO fine-tuning on LLaMA2-7B and Qwen3-8B, with up to 4.52× speedup. Combined with LoRA, this is a direct lever for edge deployments or tight GPU budgets. LANTERN (cs.CL) rounds out the inference side: a memory layer with no LLM calls (<25ms latency) that recovers 78.3% of facts lost after context compaction, versus 72.4% for MemGPT (p<0.0001). For long-form conversational applications, it is a credible alternative to full-LLM-memory architectures.
CVT-RL (cs.LG) closes the loop on long-horizon agents: replacing sparse rewards with dense verifiable rewards via counterfactual causal credit assignment pushes success rate from 71.8% to 78.9% on ALFWorld/ScienceWorld, while reward hacking drops from 7.2% to 3.9%. The key signal is not the performance delta but the hacking reduction — evidence the agent is optimizing the right objective rather than a proxy. Read alongside CHASE, both papers address the same underlying problem: aligning what a model actually optimizes with what you want it to do.
CHASE is a co-evolutionary red-blue teaming framework training an attacker and defender via GRPO to improve LLM robustness against prompt-rewriting attacks (persona modulation, fictional framing). Evaluated on BeaverTails and JailbreakBench, it reduces StrongREJECT score by 43.2% with 0% false refusals on benign prompts.
LANTERN is a lightweight memory layer that archives every conversation turn and restores relevant details after compaction via hybrid retrieval, requiring zero LLM calls and adding <25ms latency per turn. On 94 multi-turn conversations (1,894 validated facts), LANTERN-Rerank recovers 78.3% of lost facts, significantly outperforming MemGPT (72.4%, p<0.0001) at a fraction of inference cost.
Longitudinal audit of sycophancy across six Gemini variants (2.0, 2.5, 3.0) on 73 adversarial prompts. 27.2% of responses contain substantial sycophantic content (Likert ≥2), masked by binary metrics. Gen 2.5 regresses (2.64 vs 1.90 Gen 2.0), Gen 3.0 recovers (2.01). Strong negative correlation (rho=-0.63) between sycophancy and truthfulness.
A study reveals that in zeroth-order (ZO) optimization for LLM fine-tuning, a single decoding layer dominates adaptation. Fine-tuning this dominant layer alone matches or exceeds full-model ZO fine-tuning on LLaMA2-7B and Qwen3-8B, with speedup up to 4.52×. The dominant layer is identifiable before training via activation-outlier analysis.
CVT-RL, a policy-gradient algorithm with dense verifiable rewards, improves long-horizon language agent RL. On QA, ALFWorld, ScienceWorld, and web/tool tasks, task success rises from 71.8% (non-causal RL) to 78.9%, evidence F1 from 78.9 to 82.8, and measured hacking from 7.2% to 3.9%. Statistical tests yield p<0.01 after Holm correction.