Archives

June 2026

516 articles

arXiv cs.AI·

Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration

SCALE is a self-improving framework for web agents using MLLMs. It employs three adversarial roles (Selector, Predictor, Judger) to autonomously explore agent limitations and expand cognitive boundaries. SCALE-Hop optimizes global planning via graph exploration. A SCALE-20k dataset from 19 real websites with 20k structured demonstrations validates the approach across multiple MLLMs.

AI AgentsVisionReinforcement learning
SIG
72
HYP
35
arXiv cs.AI·

Healthcare Mechanisms from Policy-as-Code Search under Strategic Provider Response

Researchers reframe healthcare mechanism design as program synthesis for LLMs. Medi-Sim, a multi-agent simulator, evaluates rule programs against strategic provider responses (coding, selection, delay, effort, triage). LLM-guided evolutionary code search synthesizes a mixed-objective program that eliminates up-coding, halves rejections, and retains baseline profitability.

AI AgentsMulti-agentCode generation
SIG
72
HYP
25
arXiv cs.AI·

COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

COLLEAGUE.SKILL is an automated trace-to-skill distillation system for generating person-grounded AI skills via expert knowledge extraction. The system produces versioned packages with two coordinated tracks: capability (practices, mental models, decision heuristics) and bounded behavior (communication style, interaction rules). 18.5k GitHub stars, 215 skills from 165 contributors.

AI AgentsPrompt engineeringOpen source
SIG
72
HYP
25
arXiv cs.LG·

Unicorn: Scaling High-Dimensional Time Series Forecasting via Universal Correlation Modeling

Unicorn, a multi-dataset pretraining framework, bridges the trade-off between channel-independent models (scalable but ignoring dependencies) and channel-dependent models (expressive but dimension-bounded). Using a latent prototype codebook, it projects heterogeneous channels into a shared space to learn identity-agnostic, reusable correlation patterns transferable across domains.

PapersBenchmarksFine-tuning
SIG
72
HYP
28
arXiv cs.LG·

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

A new counterfactual evaluation metric (CSS) reveals that six frontier models ranked similarly on traditional coverage-based metrics rank in nearly opposite order when assessed on their ability to update clinical recommendations in response to oncology case mutations. All models fail on surgery-status interventions, a safety blind spot invisible to coverage metrics.

BenchmarksEvalsAI Agents
SIG
82
HYP
18
arXiv cs.LG·

Gait2Hip-60: A Unified Deep Learning Benchmark for Predicting Hip Muscle Forces and Joint Moments from Multi-Cadence Gait Kinematics

Unified Gait2Hip-60 benchmark comparing LSTM, Transformer, and Mamba to predict hip muscle forces and joint moments from gait kinematics. Transformer outperforms other models (R²=0.819 for forces, R²=0.862 for moments). External validation on 9 femoral head osteonecrosis patients shows moderate generalization (R²=0.537–0.569).

BenchmarksReasoning
SIG
72
HYP
18
arXiv cs.AI·

Gradient-Free Training of Spiking Neural Networks via Low-Rank Evolution Strategies

EGGROLL, a low-rank factorization of Evolution Strategies perturbations, reduces memory complexity from O(mn) to O(r(m+n)) for gradient-free training of Spiking Neural Networks. On N-MNIST, the method achieves 79.21% test accuracy with 2.23× speedup versus full-rank ES, enabling on-chip learning on neuromorphic hardware without surrogate gradients.

PapersBenchmarksReinforcement learning
SIG
72
HYP
15
arXiv cs.AI·

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

arXiv study on iterative refinement of LLM-generated reward functions for sparse structured RL. Authors identify two dominant failure modes (reward flooding, semantic misunderstanding) and propose diagnostic-driven refinement guided by failure-mode taxonomy. Results: DoorKey-8x8 improves from 2.3% to 97.6%, KeyCorridor from 31.2% to 86.7%. Limitations: method restricted to PPO and sparse structured tasks.

Reinforcement learningLlamaPrompt engineering
SIG
72
HYP
18
arXiv cs.AI·

Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

CoSee, an auditing framework, analyzes failure modes of modular visual reasoning systems using shared working memory. On 4B–8B models, two dominant failure modes emerge: Noise Reinforcement (reusing ungrounded notes) and Policy Collapse (under-specified answers). The study shows naive shared workspaces amplify hallucinations without explicit verification.

VisionAI AgentsMulti-agent
SIG
72
HYP
18
arXiv cs.CL·

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

MASA (Model-Aware Skill Alignment) adapts procedural skills for LLM agents to each model backbone without weight modification. A hierarchical evolution pipeline rewrites skills via hill climbing and UCB-driven tree search, then a lightweight rewriter trained on trajectories reproduces adaptation in a single forward pass. Gains up to 25.8 points across three interactive environments and four backbones.

AI AgentsPrompt engineeringReasoning
SIG
78
HYP
25
arXiv cs.LG·

Supervised Training Rapidly Degrades Early Visual Cortex Alignment Across Biologically Plausible Learning Rules

Untrained neural networks match early visual cortex better than trained networks. Study on 720 THINGS images and fMRI from 3 subjects shows one training epoch reduces V1 alignment by 25-90% depending on learning rule. Backpropagation degrades most (Δr = -0.080), while predictive coding and STDP preserve alignment better (Δr ~ -0.04).

PapersReasoningAlignment
SIG
75
HYP
15
arXiv cs.AI·

HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster

Novel transformer-based architecture for autonomous resource management in heterogeneous satellite clusters (optical and SAR). Uses model-free reinforcement learning for real-time decision-making in Earth Observation missions. Demonstrates significant performance improvements and transferability across varying cluster sizes.

Multi-agentReinforcement learningReasoning
SIG
72
HYP
15
arXiv cs.LG·

Destruction is a General Strategy to Learn Generation; Diffusion's Strength is to Take it Seriously; Exploration is the Future

Theoretical paper positioning diffusion models as part of a family of learning techniques that withhold information and train models to recover it. Author argues destruction-based information withholding is more flexible than hand-crafted techniques, especially in data-scarce settings. Raises exploration challenges and proposes diffusion-native research directions.

PapersReasoning
SIG
45
HYP
25
arXiv cs.CL·

COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models

COFT is a training-free decoding method that reduces biases in LLM chain-of-thought generation. It uses masked counterfactual prompts and logit fusion to attenuate attribute-driven biases, with distribution-free marginal validity guarantees. Evaluation across 6 models: 30-55% bias reduction (median 38%) with negligible utility loss and ≤11% computational overhead.

ReasoningAI safetyAlignment
SIG
78
HYP
15
arXiv cs.AI·

LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability

LLM-FACETS is an open-source framework for evaluating LLM factuality, epistemic calibration, and reproducibility. Web interface, plugin architecture, deterministic metrics (BLEU, ROUGE, BERTScore) run locally, log-probability visualization, multi-judge consensus, RAG Triad metrics. Designed for technical experts, domain experts, and compliance officers per EU AI Act and NIST standards.

EvalsAI safetyAlignment
SIG
78
HYP
15
arXiv cs.CL·

Auditing LLM Benchmarks with Item Response Theory

Item Response Theory-based method detects mislabels in 7 LLM benchmarks at 95% precision on top 200 examples across 114 models. Analysis reveals errors from mechanical labeling heuristics, inherited annotation mistakes, and fundamentally ambiguous items. Reward models specialize in stylistic preference over factual knowledge; one frontier model agrees with detected mislabels at 78% accuracy versus 38% for peers.

BenchmarksEvalsPapers
SIG
78
HYP
15
arXiv cs.CL·

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

Researchers reveal that statistical watermarks in LLMs are vulnerable to linear ensembles. Averaging probability distributions across 3-5 models cancels out watermark perturbations. WASH (Watermark Attenuation via Statistical Hybridisation) defeats detection across 6 watermarking schemes, reducing z-scores from 5-300 to <2 (threshold: 4), while improving output quality by 27.5%.

AI safetyAlignmentPapers
SIG
82
HYP
25
arXiv cs.AI·

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

Study on harness self-evolution (prompts, skills, memories, tools) in LLM agents. Analyzes two capabilities: harness-updating (producing useful updates) and harness-benefit (benefiting from them). Findings: harness-updating is capability-agnostic (Qwen3.5-9B matches Claude Opus gains), while harness-benefit is non-monotonic (mid-tier models benefit most).

AI AgentsPrompt engineeringBenchmarks
SIG
75
HYP
15
arXiv cs.LG·

Scientific Machine Learning for Engine Health Management and Remaining Useful Life Prediction

Scientific ML framework for turbine Remaining Useful Life (RUL) prediction. Shared encoder (CNN + bidirectional LSTM + attention pooling) with task-specific heads predicts turbine gas temperature, Delta TGT, and RUL with quantified uncertainty intervals. Evaluated on heterogeneous real-world fleet data using MAE, PICP, MPIW, and coverage-width criterion metrics.

ReasoningMulti-agentBenchmarks
SIG
72
HYP
15
arXiv cs.LG·

DisasterLex: An Expert Concept-to-Schema Knowledge Graph for Geospatial Reasoning in Disaster Analytics

DisasterLex is a knowledge-graph-mediated text-to-SQL framework for querying geospatial disaster-analytics databases. It uses an Expert Knowledge Graph (107 concepts, 117 causal edges) to route natural-language queries across 36 heterogeneous tables. On 75 test queries, it outperforms 4 baselines (LightRAG, HippoRAG 2, ReFoRCE, CHESS) by 1.4x to 2.75x.

RAGReasoningBenchmarks
SIG
78
HYP
15