Page 14 of 138

AllHigh signalRecent
5481 articles
arXiv cs.LG·

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

CSA (Conformal Selective Acting) is a deployment wrapper for RLVR-fine-tuned LLMs guaranteeing per-round risk control without pooling across deployments. Tested on 480 specialist streams and 10,300 Expert-Iteration rounds with LoRA, CSA maintains a Ville e-process per threshold and achieves selective-risk bound R_T^act ≤ α+O(N_T^{-1/2}) with anytime pathwise validity.

Reinforcement learningAI safetyEvals
SIG
78
HYP
15
arXiv cs.LG·

Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

Weasel is a trajectory selection method for offline training of web agents. It optimizes a balance between importance and diversity across states, websites, and interaction patterns, with target-centered AXTree pruning. On WebArena, WorkArena, and MiniWob, it improves out-of-domain generalization with 9.7-12.5× training speedups over standard fine-tuning on Qwen2.5-7B, Gemma3-4B, and Qwen3-8B.

AI AgentsFine-tuningBenchmarks
SIG
78
HYP
18
arXiv cs.CL·

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

Two-stage pipeline for captioning cultural images in Indigenous languages: Qwen2.5-VL generates Spanish intermediate caption, then Gemini 2.5 Flash produces target-language caption via retrieval-augmented prompting. Achieves 164.1% (Bribri), 131.7% (Guaraní), 122.6% (Orizaba Nahuatl) improvements over baseline. Overall winner of AmericasNLP 2026 shared task.

VisionRAGGemini
SIG
78
HYP
25
arXiv cs.AI·

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

Self-evolving skill libraries suffer silent degradation termed 'library drift': unbounded accumulation without lifecycle management. Study isolates mechanism via ablations, provides trace-level diagnostics, and validates fix (outcome-driven retirement + bounded active-cap + meta-skill prior) lifting pass@1 from 0.258 baseline to 0.584 on MBPP+ hard-100.

AI AgentsCode generationBenchmarks
SIG
78
HYP
15
arXiv cs.CL·

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

Stepwise Confidence Attribution (SCA) diagnoses multi-step reasoning failures in closed-source LLMs by assigning step-level confidence from generated traces alone. Two methods: NIBS (non-parametric) and GIBS (graph-based). On mathematical reasoning and multi-hop QA, SCA reliably identifies error-prone steps and improves self-correction success by up to 13.5%.

ReasoningEvalsPapers
SIG
78
HYP
15
arXiv cs.LG·

Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect

Study of literary primitives in Llama 3.1 8B-Instruct and Gemma 2 9B-IT using sparse autoencoders. Four feature classes identified: naming-gates (affect tokens), self cluster (first-person register), stylistic modulators, compositional emotions. Llama achieves 27/27 emotion coverage (Cowen-Keltner taxonomy), Gemma 23/27. Validated via 5-LLM judge panel.

LlamaGeminiFine-tuning
SIG
78
HYP
15
arXiv cs.LG·

Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

Automated framework for generating fine-grained evaluation benchmarks for foundation models. Multi-agent pipeline with solution-graph-driven strategy improves ground-truth solution reliability. Three benchmarks generated (ML, Corporate Finance, Personal Finance) show lower error rates than MMLU/GSM8K. Evaluation of 12 models reveals performance differences missed by existing benchmarks.

BenchmarksEvalsMulti-agent
SIG
78
HYP
25
arXiv cs.LG·

Efficient Conditioning Why Pseudo Observation Batch Bayesian Optimization Works When It Does not

Theoretical study unifying batch selection methods in parallel Bayesian Optimization (Constant Liar, Kriging Believer, fantasy models). Authors identify efficient conditioning as key surrogate property of Gaussian Processes, proving generation of distinct points with separation of order l. Experimental validation on Hartmann6D, Ackley 8D, Levy10D and SVM hyperparameter tuning.

BenchmarksPapers
SIG
78
HYP
15