Archives

May 2026

3148 articles

arXiv cs.CL·

Scale Determines Whether Language Models Organize Representation Geometry for Prediction

Study on how representation geometry organization in language models depends on scale. Subspace PGA metric tests alignment of intermediate geometry with unembedding matrix readout. Small models (≤1024) progressively lose organization at late layers during training, while large models (≥2048) preserve it throughout. Scale determines how geometry organizes for prediction.

PapersReasoningEvals
SIG
78
HYP
15
arXiv cs.AI·

Attractor-Vascular Coupling Theory: Formal Grounding and Empirical Validation for AAMI-Standard Cuffless Blood Pressure Estimation from Smartphone Photoplethysmography

Attractor-Vascular Coupling Theory (AVCT): mathematical framework showing cardiac attractor geometry encodes blood pressure information. Calibrated LightGBM model on smartphone PPG achieves MAE 2.05 mmHg (SBP) and 1.67 mmHg (DBP) in strict leave-one-subject-out cross-validation (46 subjects, 29,684 windows), meeting AAMI/IEEE SP10 criteria. PPG-only ablation matches ECG+PPG within 0.05 mmHg.

PapersBenchmarksEvals
SIG
78
HYP
15
arXiv cs.AI·

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Asynchronous RL pipelines for LLM agents lose historical old logits required for PPO off-policy correction, entangling discrepancy repair with staleness correction. The paper proposes three acquisition strategies (snapshot, dedicated model, interruption) and a revised PPO-EWMA method to preserve decoupled correction semantics.

AI AgentsReinforcement learningReasoning
SIG
72
HYP
15
arXiv cs.CL·

Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

Large Reasoning Models generate traces aligned with human reaction times, but this alignment persists regardless of inference-time reasoning budget. Study across GPT-OSS-20B and GPT-OSS-120B: token allocation tracks human difficulty patterns and remains invariant across effort levels, suggesting cognitive cost alignment is crystallized at training time.

ReasoningBenchmarksPapers
SIG
72
HYP
15
arXiv cs.CL·

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

PARALLAX reveals that 4 of 6 major hallucination detection benchmarks embed the ground-truth answer in the prompt, allowing a naive baseline (TxTemb) to achieve near-perfect detection without access to model internals. Evaluation of 22 methods across 12 open-source models: most fail under controlled conditions, except SAPLMA and DRIFT (supervised probes on upper-layer hidden states).

BenchmarksEvalsAI safety
SIG
82
HYP
15
arXiv cs.CL·

Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

VLMs struggle with planning from complex visual inputs. This paper proposes Pattern Induction, an online inductive learning strategy that discovers and optimizes reusable visual patterns as composite experts. Pattern Inference enables VLMs to recognize these patterns and directly infer world model structures. Evaluated on FrozenLake, Crafter, and CubeBench.

VisionReasoningPapers
SIG
45
HYP
35
arXiv cs.CL·

Taming "Zombie'' Agents: A Markov State-Aware Framework for Resilient Multi-Agent Evolution

AgentRevive introduces a Markov state-aware framework for resilient multi-agent LLM system evolution. Instead of aggressively pruning failing agents, the method uses soft state transitions (Active/Standby/Terminated) with a hallucination risk estimator. Results: outperforms baselines on general reasoning, domain-specific tasks, and hallucination challenges while reducing token consumption.

Multi-agentAI AgentsReasoning
SIG
72
HYP
25
arXiv cs.CL·

AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering

AMATA is an adaptive multi-agent trajectory alignment framework for knowledge-intensive question answering. Six specialized agents collaboratively perform structured actions to improve factual consistency and reduce hallucinations. The system formalizes multi-agent collaboration as a trajectory preference alignment problem with intra-trajectory and inter-agent dependency learning.

AI AgentsMulti-agentReasoning
SIG
72
HYP
28
arXiv cs.CL·

BELIEF: Structured Evidence Modeling and Uncertainty-Aware Fusion for Biomedical Question Answering

BELIEF combines structured evidence modeling and uncertainty-aware fusion for biomedical question answering. The framework converts retrieved documents into evidence objects (clinical attributes, source quality, relevance, support strength) and fuses two reasoning paths: symbolic (Dempster-Shafer theory) and neural (LLM). SOTA results on PubMedQA, MedQA, MedMCQA across 5 LLM backbones.

RAGReasoningEvals
SIG
78
HYP
15
arXiv cs.CL·

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

Systematic evaluation of synthetic clinical notes generated by LLMs at million-note scale from MIMIC databases. Study shows synthetic notes preserve core clinical information for coarse-grained tasks but lose fine-grained details for ICD coding. Chunk-based rephrasing mitigates detail loss but reduces factual precision under incomplete context.

BenchmarksEvalsAI safety
SIG
78
HYP
15
arXiv cs.CL·

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

PROTEA is an interface for offline debugging and refinement of multi-agent LLM workflows. It evaluates intermediate outputs with configurable rubrics, localizes bottlenecks via workflow graph visualization, and generates targeted prompt revisions. On two production-adjacent workflows, PROTEA improves document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38.

Multi-agentAI AgentsPrompt engineering
SIG
78
HYP
18
arXiv cs.CL·

Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture

Formal verification framework for LLM pipelines using Lean 4 certificates. Three certificate families (conflict-aware bilattice, embedding sensitivity, Hoare-style agent action) plus two operators (Maximal Certifiable Residue, Compositional Stability) for high-stakes deployments (regulated finance, clinical support, agentic systems). Compiled artifact covers 22 certificate types, 17/46 declarations axiom-free.

ReasoningAI safetyAI Agents
SIG
78
HYP
15
arXiv cs.CL·

LLM-Based Intelligent Notification Composition: From Static Personalization to Context-Aware Persuasive Messaging

Study on using LLMs to compose personalized and persuasive push notifications. Authors define 6 quality dimensions (contextual relevance, clarity, actionability, etc.) and demonstrate +8% to +14.5% CTR gains vs static templates. Proposes architectural framework with budget-aware routing, grounded generation, and online learning.

Prompt engineeringRAGBusiness
SIG
72
HYP
28
arXiv cs.CL·

LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT is a benchmark evaluating agents' memory management in long contexts (up to 1.8M tokens) with multi-target interference. 15.6k QA pairs across 4 domains (state tracking, dialogue, Wikipedia revisions, GitHub commits). 7 systems tested (long-context LLMs, RAG, agent frameworks) achieve 27.9% average accuracy, bottlenecked by retrieval and memory construction.

AI AgentsBenchmarksRAG
SIG
78
HYP
15
arXiv cs.AI·

QuantFPFlow: Quantum Amplitude Estimation for Fokker--Planck Policy Optimisation in Continuous Reinforcement Learning

QuantFPFlow integrates quantum amplitude estimation (Grover) into stochastic policy optimization via Fokker-Planck formulation. Provable quadratic speedup O(1/ε) vs O(1/ε²) classical. On continuous multimodal task, outperforms SAC (1295.7 vs 1284.0 reward) and finds global optimum 10.4% more frequently (33.9% vs 30.7%).

Reinforcement learningReasoningBenchmarks
SIG
72
HYP
25
arXiv cs.AI·

STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

STRIDE is a self-reflective agent framework for LLM-based symbolic equation discovery. It improves reliability by coordinating data-aware generation, mixed-fitting evaluation, critic-executor repair, and diversity-preserving semantic memory. Experiments on symbolic regression benchmarks show gains in accuracy, OOD robustness, and structural recovery across multiple LLM backbones.

AI AgentsReasoningBenchmarks
SIG
72
HYP
25
arXiv cs.AI·

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

Study reveals dataset visibility asymmetry in multilingual NLP: 118 languages (59% of 200 most-spoken) have zero catalogued datasets per LRE Map and LDC. Using LLM-assisted citation-mining on Semantic Scholar, authors identify 609 unique datasets across 53 low-visibility languages, 356 publicly accessible. Data scarcity is a documentation and discoverability issue, not just production.

BenchmarksOpen sourcePapers
SIG
78
HYP
15
arXiv cs.AI·

An Interpretable Closed-Loop Intelligent Tutoring System for Multimodal Affective Feedback in Asynchronous Presentation Training

Closed-loop intelligent tutoring system using XGBoost to assess oral presentation skills via multimodal analysis (facial, vocal, textual, oculomotor). Trained on 10,360 MOOC videos, generates feedback aligned to 7-dimensional BARS scale. Study with 204 learners over 30 days: significant improvements (Cohen's d = 0.39-0.90), strong correlation between practice frequency and performance.

EvalsVisionVoice
SIG
72
HYP
18
arXiv cs.AI·

Beyond Accuracy: Robustness, Interpretability and Expressiveness of EEG Foundation Models

Comparative study of 6 EEG foundation models across 8 datasets beyond clean accuracy. Robustness analysis (noise, channel dropout), interpretability via Attention-Aware Layer-Wise Relevance Propagation, and expressiveness through block-wise probing. Findings: no single model dominates all failure modes; models focus on task-appropriate brain regions but decode corrupted content poorly.

BenchmarksEvalsAI safety
SIG
75
HYP
15