May 2026

3149 articles

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

DISA is an offline RL method for LLMs that decouples partition-function estimation (via importance sampling) from policy optimization. On 9 benchmarks (math and code), it matches or exceeds FlowRL, outperforms GRPO/GSPO, and retains substantially more strategy-level diversity than reward-maximization baselines.

Reinforcement learning Reasoning Code generation

SIG

HYP

arXiv cs.CL·May 19

FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers

FishBack proposes activation steering using pullback Fisher geometry for transformers. Authors show activation space is non-Euclidean (>97% deviation on GPT-2) and derive closed-form optimal steering equation. Method outperforms CAA, ActAdd, ITI by 1.3×–2.5× on off-target KL reduction.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 19

ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

ChemVA framework advances LLM understanding of chemical reaction diagrams through Visual Anchor mechanism for functional group detection and semantic alignment translating visual features to entity names. Achieves 92.0% structural recognition accuracy on OCRD-Bench dataset and ~20 percentage point performance gain across 9 diverse LLMs.

Vision Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Responsible Agentic AI Requires Explicit Provenance

An arXiv paper argues that responsible agentic AI requires explicit, traceable provenance across the full lifecycle. Authors formalize this through a causal attribution function and responsibility tensor, demonstrating provenance is estimable and interventionable online before irreversible harm accumulates.

AI Agents AI safety Alignment

SIG

HYP

arXiv cs.CL·May 19

HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

HEED introduces density-weighted residual alignment for distilling vision-language models (e.g., Qwen3-VL-8B) into hybrid Mamba-2/attention architectures. The method targets high-density patches (text, fine details) experiencing 3.6× larger residual drift. Results: +8.7 points OCRBench v2, +5.13 points average, 4.12× throughput, 68% memory savings.

Vision Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Scale Determines Whether Language Models Organize Representation Geometry for Prediction

Study on how representation geometry organization in language models depends on scale. Subspace PGA metric tests alignment of intermediate geometry with unembedding matrix readout. Small models (≤1024) progressively lose organization at late layers during training, while large models (≥2048) preserve it throughout. Scale determines how geometry organizes for prediction.

Papers Reasoning Evals

SIG

HYP

arXiv cs.CL·May 19

D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

D²Evo is an RL framework to enhance LLM reasoning. It addresses scarcity of medium-difficulty samples by mining anchors matched to model capability and training a Questioner to generate diverse questions at appropriate difficulty. Results: outperforms existing methods on math benchmarks with <2K real samples.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Attractor-Vascular Coupling Theory: Formal Grounding and Empirical Validation for AAMI-Standard Cuffless Blood Pressure Estimation from Smartphone Photoplethysmography

Attractor-Vascular Coupling Theory (AVCT): mathematical framework showing cardiac attractor geometry encodes blood pressure information. Calibrated LightGBM model on smartphone PPG achieves MAE 2.05 mmHg (SBP) and 1.67 mmHg (DBP) in strict leave-one-subject-out cross-validation (46 subjects, 29,684 windows), meeting AAMI/IEEE SP10 criteria. PPG-only ablation matches ECG+PPG within 0.05 mmHg.

Papers Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 19

The IsalProgram Programming Language

IsalProgram is a regular assembly-like language where every finite string is a valid program. Executed on a virtual machine with circular doubly linked list and three data pointers, it eliminates memory addresses and variable names. Proposed as a target for neural program synthesis.

Code generation Papers

SIG

HYP

arXiv cs.CL·May 19

DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios

DriveSafe is a framework for risk assessment in autonomous driving scenarios. It generates spatially grounded captions enriched with motion and depth cues, then fine-tunes a lightweight adapter to identify hazardous objects and suggest safety actions. Achieves SOTA on DRAMA benchmark.

Vision Reasoning AI safety

SIG

HYP

arXiv cs.AI·May 19

Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery

SkyPart, a lightweight head for vision transformers, improves cross-view drone-to-satellite geo-localization by explicitly separating layout and texture via learnable prototypes. At 26.95M parameters, it achieves state-of-the-art on SUES-200, University-1652, and DenseUAV with enhanced robustness under weather corruptions.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Asynchronous RL pipelines for LLM agents lose historical old logits required for PPO off-policy correction, entangling discrepancy repair with staleness correction. The paper proposes three acquisition strategies (snapshot, dedicated model, interruption) and a revised PPO-EWMA method to preserve decoupled correction semantics.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Orthrus unifies autoregressive LLM fidelity with parallel diffusion token generation via a dual-architecture framework. A lightweight trainable module augments a frozen Transformer to enable parallel generation while maintaining exact autoregressive quality. Achieves up to 7.8x speedup with O(1) memory overhead.

Reasoning Code generation Infrastructure

SIG

HYP

arXiv cs.AI·May 19

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

LLM-based multi-agent pipelines flip to incorrect answers under simulated peer disagreement (yield). Contrary to common attribution, RLHF is not responsible: pretrained base models exhibit the same substitution pattern. Activation patching localizes corruption to a narrow mid-layer window. A single correctly-arguing dissenter reduces yield by 54-73 percentage points.

Multi-agent Alignment Reasoning

SIG

HYP

arXiv cs.AI·May 19

Embracing Biased Transition Matrices for Complementary-Label Learning with Many Classes

Complementary-label learning (CLL) methods remain limited to 10-class classification. This paper proposes BICL, a framework that deliberately uses biased (non-uniform) transition matrices to restrict complementary labels to class subsets. On CIFAR-100 and TinyImageNet-200, BICL achieves 7× accuracy improvements over traditional methods.

Papers Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

No Free Swap: Protocol-Dependent Layer Redundancy in Transformers

Study showing that two protocols for evaluating layer redundancy in transformers (replacement and interchange) yield divergent results for identifying layers to prune. On Pythia, Qwen3-8B, and Llama-3.1-8B, the protocol gap dramatically changes which layers appear safe to remove, even under the same KL evaluator.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.CL·May 19

A Scalable Tool for Measuring Manner and Result Verbs in Developmental Language Research

Computational tool to classify manner and result verbs at scale. Uses linguistically informed prompts with LLMs to generate annotations over MASC and InterCorp data (436 VerbNet classes). RoBERTa-based classifier achieves 89.6% accuracy on three held-out gold-standard datasets. Applicable to developmental research on verb semantics.

Papers Benchmarks Fine-tuning

SIG

HYP

arXiv cs.CL·May 19

Language Acquisition Device in Large Language Models

Researchers propose LAD-inspired PPT, pre-pretraining on MP-STRUCT, a formal language encoding hierarchical composition and long-distance displacement. After 500 steps, this approach matches formal-language baselines in token efficiency while imparting LLMs human-like resistance to structurally implausible languages.

Papers Reasoning Fine-tuning

SIG

HYP

arXiv cs.CL·May 19

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

AgentKernelArena is an open-source benchmark for evaluating AI agents on GPU kernel optimization. It contains 196 tasks (HIP-to-HIP, Triton-to-Triton, PyTorch-to-HIP) and tests generalization on unseen configurations. Tested agents (Cursor Agent, Claude Code, Codex) achieve speedups up to 6.89x, but show generalization weaknesses on PyTorch-to-HIP.

AI Agents Code generation Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Constrained Code Generation with Discrete Diffusion

Discrete diffusion models generate code through iterative refinement. CDC (Constrained Diffusion for Code) integrates constraints directly into the denoising process without additional training, combining mathematical optimization and program analysis to improve functional correctness, security, and syntax.

Code generation Reasoning AI safety

SIG

HYP

arXiv cs.CL·May 19

RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis

RTI-Bench is a structured dataset of 1,516 Indian Central Information Commission (CIC) decisions with outcome labels, exemption citations, and IRAC-style reasoning components. Mistral 7B achieves 57.3% accuracy on outcome prediction (baseline 14.3%). First publicly released structured dataset for Indian RTI administrative decision analysis.

Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 19

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

MixSD is an external-teacher-free fine-tuning method that injects knowledge by dynamically mixing tokens from two model conditionals: an expert branch observing the injected fact, and a naive branch reflecting original priors. On QA and knowledge-editing benchmarks, MixSD retains up to 100% of base model capabilities versus 1% for standard SFT.

Fine-tuning Reasoning Papers

SIG

HYP

arXiv cs.CL·May 19

JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR

JSPG introduces a dynamic filtering framework for Chinese contextual ASR combining semantic, pinyin, and glyph features. The approach uses an extended Smith-Waterman algorithm to score N-best hypothesis sequences against keywords. Experiments on Aishell-1 and RWCS-NER datasets show significant improvements in keyword recognition accuracy.

Voice Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 19

Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

Large Reasoning Models generate traces aligned with human reaction times, but this alignment persists regardless of inference-time reasoning budget. Study across GPT-OSS-20B and GPT-OSS-120B: token allocation tracks human difficulty patterns and remains invariant across effort levels, suggesting cognitive cost alignment is crystallized at training time.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 19

Closing the Gap at CRAC 2026: Two-Stage Adaptation for LLM-Based Multilingual Coreference Resolution

Multilingual coreference resolution system based on Gemma-3-27b with two-stage adaptation (multilingual base adapters then dataset-specific adapters). CoNLL F1 score of 74.32 on CRAC 2026 test set, ranked first in LLM track. Mention spans represented by headword using XML-inspired format with local reindexing.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 19

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

SkillTTA synthesizes task-specific textual skills by retrieving relevant training trajectories, with no parameter updates to the solver model. Evaluated on SpreadsheetBench, ALFWorld, and BigCodeBench: SpreadsheetBench improves from 0.397 to 0.505 Pass@1, BigCodeBench from 0.517 to 0.651.

AI Agents Prompt engineering Reasoning

SIG

HYP

arXiv cs.CL·May 19

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

PARALLAX reveals that 4 of 6 major hallucination detection benchmarks embed the ground-truth answer in the prompt, allowing a naive baseline (TxTemb) to achieve near-perfect detection without access to model internals. Evaluation of 22 methods across 12 open-source models: most fail under controlled conditions, except SAPLMA and DRIFT (supervised probes on upper-layer hidden states).

Benchmarks Evals AI safety

SIG

HYP

arXiv cs.CL·May 19

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

PluRule is a multimodal multilingual benchmark for moderating pluralistic communities on social media. It covers 13,371 rule violations across 1,989 Reddit communities (9 languages, 2,885 rules). State-of-the-art vision-language models, including GPT-4.5 with advanced reasoning, only marginally outperform a trivial baseline.

Benchmarks Vision AI safety

SIG

HYP

arXiv cs.CL·May 19

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

SEMA-RAG is a multi-agent framework for retrieval-augmented generation applied to medical reasoning. It decouples the process into three specialist agents: clinical interpretation, iterative document exploration, and evidence adjudication. Tested on 5 benchmarks and 5 LLM backbones, it improves baselines by +6.46 accuracy points on average.

Multi-agent RAG Reasoning

SIG

HYP

arXiv cs.CL·May 19

Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages

Tutorial on multilingual multimodal LLMs for low-resource languages. Covers recent models (PALO, Maya), speech-text-vision pipelines, low-cost data creation, tri-modal alignment via adapters, and culture-aware evaluation beyond English.

Vision Voice

SIG

HYP

arXiv cs.CL·May 19

Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

VLMs struggle with planning from complex visual inputs. This paper proposes Pattern Induction, an online inductive learning strategy that discovers and optimizes reusable visual patterns as composite experts. Pattern Inference enables VLMs to recognize these patterns and directly infer world model structures. Evaluated on FrozenLake, Crafter, and CubeBench.

Vision Reasoning Papers

SIG

HYP

arXiv cs.CL·May 19

Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making

arXiv study demonstrating that 9 frontier LLMs inherit stigmatizing language bias from clinical notes. Models reduce treatment aggressiveness when exposed to a single stigmatizing sentence. Chain-of-Thought and self-debiasing show limited mitigation efficacy.

AI safety Alignment Evals

SIG

HYP

arXiv cs.CL·May 19

FIM-LoRA: Task-Informative Rank Allocation for LoRA via Calibration-Time Gradient-Variance Estimation

FIM-LoRA optimizes rank allocation in LoRA by using 8 calibration passes to estimate gradient variance per layer. This parameter-free approach matches standard LoRA performance (88.6 vs 88.7 on GLUE with DeBERTa-v3-base) while reducing memory costs by 256x compared to full Fisher estimation.

Fine-tuning Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Injecting mathematically wrong drafts from a smaller model (Qwen2.5-Math-1.5B) mismatched to the current problem into a stronger learner's (Mathstral-7B) GRPO context outperforms standard on-policy GRPO. On MATH-500, the mismatched-wrong variant reaches 71.98% (highest published result for this model), +1.62pp vs matched-wrong variant, without SFT or reward models.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Taming "Zombie'' Agents: A Markov State-Aware Framework for Resilient Multi-Agent Evolution

AgentRevive introduces a Markov state-aware framework for resilient multi-agent LLM system evolution. Instead of aggressively pruning failing agents, the method uses soft state transitions (Active/Standby/Terminated) with a hallucination risk estimator. Results: outperforms baselines on general reasoning, domain-specific tasks, and hallucination challenges while reducing token consumption.

Multi-agent AI Agents Reasoning

SIG

HYP

arXiv cs.CL·May 19

AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering

AMATA is an adaptive multi-agent trajectory alignment framework for knowledge-intensive question answering. Six specialized agents collaboratively perform structured actions to improve factual consistency and reduce hallucinations. The system formalizes multi-agent collaboration as a trajectory preference alignment problem with intra-trajectory and inter-agent dependency learning.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.CL·May 19

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

Parameter-efficient vocabulary adaptation method to improve LLM tokenization on specialized domains (legal, medical). Tested on Llama-3.1-8B and Qwen2.5-7B: reduces training time by 35-55% vs continual pretraining, decreases parameters by 37% vs expansion-only, improves summary quality through domain-specific tokens.

Fine-tuning Llama Qwen

SIG

HYP

arXiv cs.CL·May 19

MiniGPT: Rebuilding GPT from First Principles

MiniGPT is a compact from-scratch PyTorch implementation of GPT-style autoregressive language modeling in a single notebook. The 10.77M-parameter model achieves validation loss of 1.4780 on Tiny Shakespeare with character-level tokenization and generates text with recognizable Shakespeare-style dialogue structure.

GPT Code generation Papers

SIG

HYP

arXiv cs.CL·May 19

BELIEF: Structured Evidence Modeling and Uncertainty-Aware Fusion for Biomedical Question Answering

BELIEF combines structured evidence modeling and uncertainty-aware fusion for biomedical question answering. The framework converts retrieved documents into evidence objects (clinical attributes, source quality, relevance, support strength) and fuses two reasoning paths: symbolic (Dempster-Shafer theory) and neural (LLM). SOTA results on PubMedQA, MedQA, MedMCQA across 5 LLM backbones.

RAG Reasoning Evals

SIG

HYP

arXiv cs.CL·May 19

The Unlearnability Phenomenon in RLVR for Language Models

Study reveals an 'unlearnability' phenomenon in Reinforcement Learning with Verifiable Reward (RLVR) for LLMs. Some hard examples remain unlearnable even with correct rollouts. Cross-example gradient analysis shows fundamental representation flaws: low gradient similarity and ungeneralizable reasoning patterns. Data augmentation fails to improve gradient similarity.

Reinforcement learning Reasoning Papers

SIG

HYP

arXiv cs.CL·May 19

Mixture of Experts for Low-Resource LLMs

Analysis of routing dynamics in two MoE architectures (Qwen3-30B-A3B and Nemotron-3-Nano-30B-A3B) reveals deep-layer routing collapse for underrepresented languages (Hebrew, Japanese). Continual pre-training on balanced bilingual data corrects this imbalance better than supervised fine-tuning alone.

Benchmarks Fine-tuning

SIG

HYP

arXiv cs.CL·May 19

Recall Isn't Enough: Bounding Commitments in Personalized Language Systems

Paper introduces Contract-Bounded Evidence Activation (CBEA) with Lexicographic Commitment Validation (LCV) for personalized language systems. CBEA+LCV achieves zero failures at 0.49-0.60 availability versus 0.003-0.092 for baselines, with 74-75% median input payload reduction.

Reasoning RAG Evals

SIG

HYP

arXiv cs.CL·May 19

Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech

Comparative study of 10 annotation workflows for conversational speech summarization. Audio-based summaries are less informative than transcript-based ones, but iterative peer-editing with audio mitigates this gap. Validates this approach for creating benchmarks incorporating lexical and prosodic information.

Benchmarks Voice Evals

SIG

HYP

arXiv cs.CL·May 19

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

Benchmark of LLMs on multi-label legal precedent treatment classification. Expert-annotated dataset of 239 real-world citations. Gemini 2.5 Flash achieves 79.1% on high-level classification, GPT-5-mini 67.7% on fine-grained schema. Novel Average Severity Error metric to measure practical impact of misclassifications.

Benchmarks Gemini GPT

SIG

HYP

arXiv cs.CL·May 19

From Documents to Segments: A Contextual Reformulation for Topic Assignment

Novel topic modeling approach (SBTA) assigns topics to text segments rather than entire documents, reducing topic contamination. Authors construct SemEval-STM, a dataset annotated via LLM + human refinement, and validate improved clustering quality and interpretability across multiple models.

Papers Benchmarks RAG

SIG

HYP

arXiv cs.CL·May 19

Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning

Researchers demonstrate that small models (Gemma 4 E4B, Qwen3-4B) fine-tuned with 8-bit QLoRA internalize tool knowledge without requiring tool schemas in prompts. On AssetOpsBench, fine-tuned models outperform unfine-tuned baselines: 82.6% input length reduction, AT-F1 of 0.65 vs 0.47, and 2.5× faster for Qwen3.

Fine-tuning AI Agents Qwen

SIG

HYP

arXiv cs.CL·May 19

To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Embeddings, Except In Heavy Truncation Scenarios

An arXiv study compares Matryoshka Representation Learning (MRL) with simple embedding truncation. Results show non-MRL embeddings remain robust up to 80% dimensionality reduction. MRL provides advantage only for heavy truncation (>80%), questioning its systematic training cost.

Embeddings Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

Systematic evaluation of synthetic clinical notes generated by LLMs at million-note scale from MIMIC databases. Study shows synthetic notes preserve core clinical information for coarse-grained tasks but lose fine-grained details for ICD coding. Chunk-based rephrasing mitigates detail loss but reduces factual precision under incomplete context.

Benchmarks Evals AI safety

SIG

HYP

arXiv cs.CL·May 19

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

SynPro, a synthetic data generation framework, helps LLMs learn more thoroughly from limited organic corpora via rephrasing and reformatting. Optimized with RL, it unlocks 3.7-5.2x more effective tokens than simple repetition on 400M and 1.1B models, even surpassing the non-data-bound oracle at 1.1B scale. Code open-sourced.

Reinforcement learning Benchmarks Open source

SIG

HYP

arXiv cs.CL·May 19

A Pilot Benchmark for NL-to-FOL Translation in Planetary Exploration

Pilot benchmark for translating natural language to First-Order Logic (FOL) in planetary exploration. Dataset built from NASA mission documentation (2003-2013), manually annotated with FOL representations capturing temporal structure, agent roles, and operational dependencies. Structured predicate vocabularies provided.

Reasoning Benchmarks Robotics

SIG

HYP

arXiv cs.CL·May 19

AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code

AutoVecCoder teaches LLMs to generate explicit vectorized code via SIMD. The framework combines VecPrompt (data synthesis to inject intrinsic knowledge) and VecRL (reinforcement learning aligned with execution efficiency). AutoVecCoder-8B achieves state-of-the-art on SimdBench (SSE/AVX) subsets and sometimes surpasses -O3 optimizations.

Code generation Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

PROTEA is an interface for offline debugging and refinement of multi-agent LLM workflows. It evaluates intermediate outputs with configurable rubrics, localizes bottlenecks via workflow graph visualization, and generates targeted prompt revisions. On two production-adjacent workflows, PROTEA improves document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38.

Multi-agent AI Agents Prompt engineering

SIG

HYP

arXiv cs.CL·May 19

Alignment Drift in Long-Term Human-LLM Interaction: A Mechanism-Oriented Framework

Study on 'alignment drift': gradual process where LLM outputs become less constrained by user's current message and more shaped by interaction history, while remaining helpful. Mechanism-oriented framework distinguishes signal A/B, feedback loops, and interactive regimes to control this cumulative drift.

Alignment AI Agents AI safety

SIG

HYP

arXiv cs.CL·May 19

Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex

MINE (Mechanistically Interpretable Neural Encoding) applies mechanistic interpretability tools to neural encoding models to localize visual features driving voxel-level activity in human visual cortex. Validated via image generation and counterfactual editing: inserting/removing predicted features shifts neural activation as expected.

Vision Papers

SIG

HYP

arXiv cs.CL·May 19

Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture

Formal verification framework for LLM pipelines using Lean 4 certificates. Three certificate families (conflict-aware bilattice, embedding sensitivity, Hoare-style agent action) plus two operators (Maximal Certifiable Residue, Compositional Stability) for high-stakes deployments (regulated finance, clinical support, agentic systems). Compiled artifact covers 22 certificate types, 17/46 declarations axiom-free.

Reasoning AI safety AI Agents

SIG

HYP

arXiv cs.CL·May 19

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

arXiv paper proposing formal framework for combining LLM and human evaluations. Uses doubly robust estimator (missing data literature) to determine optimal number of human reviews needed. Shifts LLM role from substitutive to auxiliary in two-stage sampling design.

Evals Benchmarks AI safety

SIG

HYP

arXiv cs.CL·May 19

Linguistic Uncertainty and Reply Engagement on X: A Cross-Domain Replication of the Uncertainty-Reply Asymmetry

Study of 2,258 English-language posts (April 2026) shows uncertain posts receive 82% more replies than certain posts. Regression confirms positive association (β=0.126, p=0.011), ~13% higher reply engagement. Replicates asymmetry observed in Arabic, suggesting universal interactional mechanism across languages.

Papers Evals

SIG

HYP

arXiv cs.CL·May 19

LLM-Based Intelligent Notification Composition: From Static Personalization to Context-Aware Persuasive Messaging

Study on using LLMs to compose personalized and persuasive push notifications. Authors define 6 quality dimensions (contextual relevance, clarity, actionability, etc.) and demonstrate +8% to +14.5% CTR gains vs static templates. Proposes architectural framework with budget-aware routing, grounded generation, and online learning.

Prompt engineering RAG Business

SIG

HYP

arXiv cs.CL·May 19

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

Study of 38 models on 8,900 scholarly references: factual recall quality follows a sigmoid combining model size and topic frequency in training data. These two variables explain 60-94% of variance. Model proposes recall is gated by signal-to-noise ratio scaling with concept frequency and model capacity.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.CL·May 19

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

EnvFactory automates creation of executable environments and synthesis of multi-turn trajectories for Agentic RL training. Using 85 verified environments across 7 domains, the framework generates 2,575 SFT/RL trajectories and improves Qwen3-series models by +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks.

AI Agents Reinforcement learning Code generation

SIG

HYP

arXiv cs.CL·May 19

Language-Switching Triggers Take a Latent Detour Through Language Models

Circuit analysis of a backdoor in an 8B model: a 3-word Latin trigger redirects English output to French. The circuit operates in 3 phases via attention heads, propagates through a subspace orthogonal to natural language-identity directions, then converts via MLP. A single serial bottleneck position controls the entire flow.

AI safety Alignment Papers

SIG

HYP

arXiv cs.CL·May 19

MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion

MA²P is a multi-agent autonomous framework for complex persuasion. It coordinates perception management, mental-state inference, strategy execution, and performance evaluation. A meta-cognitive configurator selects domain-appropriate meta-strategies from a knowledge base to improve generalization and persuasion success rates.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.CL·May 19

GUT-IS: A Data-Driven Approach to Integrating Constructs and Their Relations in Information Systems

Data-driven approach to integrate constructs and their relations in information systems. Combines task-adapted text embeddings and clustering to group constructs from structural equation models. Optimizes trade-off between semantic purity and parsimony through explicit loss function.

Embeddings Papers

SIG

HYP

arXiv cs.CL·May 19

LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT is a benchmark evaluating agents' memory management in long contexts (up to 1.8M tokens) with multi-target interference. 15.6k QA pairs across 4 domains (state tracking, dialogue, Wikipedia revisions, GitHub commits). 7 systems tested (long-context LLMs, RAG, agent frameworks) achieve 27.9% average accuracy, bottlenecked by retrieval and memory construction.

AI Agents Benchmarks RAG

SIG

HYP

arXiv cs.CL·May 19

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

Investigation of LRM internal representations through probe trajectories. Authors show that continuous evolution of concept probability during reasoning predicts final behavior better than static predictions. Max-pooling achieves 95% AUROC across 4 datasets (safety, mathematics).

Reasoning AI safety Evals

SIG

HYP

arXiv cs.CL·May 19

Easier to Judge than to Find: Predicting In-Context Learning Success for Demonstration Selection

DiSP, a demonstration selection framework for in-context learning, predicts whether a query-context pair will succeed rather than searching for optimal context. On 5 classification datasets with Llama 3-8B and Qwen 2.5-7B, DiSP improves accuracy by 3.4% and achieves 23× end-to-end speedup.

Prompt engineering Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning

New IH-GRPO algorithm decouples tool invocation from execution to enhance LLM mathematical reasoning. Achieves 1.87–2.53% improvements on mathematical benchmarks with Qwen3 (1.7B–8B). Code released.

Reasoning AI Agents Reinforcement learning

SIG

HYP

arXiv cs.CL·May 19

Vector RAG vs LLM-Compiled Wiki: A Preregistered Comparison on a Small Multi-Domain Research

Preregistered study comparing Vector RAG and LLM-compiled markdown wiki on 13 questions over 24 papers. Wiki excels at cross-paper synthesis and claim-level citation accuracy, but uses more query tokens. A decomposition-based RAG variant recovers most wiki advantages at lower LLM-token cost.

RAG Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 19

EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

EvoMemBench is a unified benchmark evaluating LLM agent memory along two axes: scope (in-episode vs cross-episode) and content (knowledge vs execution-oriented). Comparison of 15 memory methods: long-context baselines remain highly competitive, retrieval-based methods dominate knowledge-intensive tasks, procedural methods excel at execution-oriented tasks.

AI Agents Benchmarks RAG

SIG

HYP

arXiv cs.CL·May 19

Presupposition and Reasoning in Conditionals: A Theory-Based Study of Humans and LLMs

Comparative study of human judgments and 4 LLMs predictions on presupposition projection in conditionals. 120 participants evaluated in parallel with models. Humans integrate probabilistic and pragmatic cues; LLMs show variable alignment. Models matching human ratings lack coherent pragmatic reasoning.

Benchmarks Reasoning Papers

SIG

HYP

arXiv cs.CL·May 19

Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles

Infini-News indexes 1.35B CC-News articles (August 2016–present) with metadata extraction, language detection (GlotLID, lingua, CommonLingua), and geographic attribution (83.4% coverage). Infini-gram suffix-array indexes enable sub-second full-text pattern search across the entire archive.

RAG Vector search Benchmarks

SIG

HYP

arXiv cs.CL·May 19

From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG

EPIC is an on-device RAG index construction method optimized for personal AI agents. It reduces indexing memory by 2,404× by focusing on user preferences, improves preference-following accuracy by 20.17 percentage points, and decreases retrieval latency by 33.33×. Memory footprint under 1 MB with 29.35 ms/query latency.

RAG AI Agents Embeddings

SIG

HYP

arXiv cs.CL·May 19

Machine Unlearning for Masked Diffusion Language Models

First machine unlearning framework for masked diffusion language models (LLaDA, Dream). MDU minimizes KL divergence from prompt-conditional to prompt-masked unconditional distribution at each masked position, with temperature scaling for privacy-utility trade-off. Code released.

Papers AI safety Fine-tuning

SIG

HYP

arXiv cs.AI·May 19

QuantFPFlow: Quantum Amplitude Estimation for Fokker--Planck Policy Optimisation in Continuous Reinforcement Learning

QuantFPFlow integrates quantum amplitude estimation (Grover) into stochastic policy optimization via Fokker-Planck formulation. Provable quadratic speedup O(1/ε) vs O(1/ε²) classical. On continuous multimodal task, outperforms SAC (1295.7 vs 1284.0 reward) and finds global optimum 10.4% more frequently (33.9% vs 30.7%).

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

STRIDE is a self-reflective agent framework for LLM-based symbolic equation discovery. It improves reliability by coordinating data-aware generation, mixed-fitting evaluation, critic-executor repair, and diversity-preserving semantic memory. Experiments on symbolic regression benchmarks show gains in accuracy, OOD robustness, and structural recovery across multiple LLM backbones.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Prediction of Challenging Behaviors Associated with Profound Autism in a Classroom Setting Using Wearable Sensors

arXiv study using wearable sensors (accelerometry, EDA, temperature) and multimodal foundation models to predict challenging behaviors in 9 profoundly autistic children in classroom settings. Prediction up to 10 minutes in advance with AUC-ROC 0.78 on 110.7 hours of real-world data.

Benchmarks Papers AI safety

SIG

HYP

arXiv cs.CL·May 19

Scaling Accessible Mathematics on arXiv: HTML Conversion and MathML 4

arXiv advances its HTML Papers offering for TeX/LaTeX submissions since 2023. 2025-2026 highlights: resolved ~3,000 of 6,000 user reports, targeting 90% error-free HTML conversion (currently 75%), initial MathML 4 Intent annotations for accessible speech output, and in-progress Rust port of LaTeXML to reduce compute costs.

Infrastructure Open source

SIG

HYP

arXiv cs.AI·May 19

From Prediction to Intervention: The Evolution of AI in Biomedicine

Theoretical paper arguing AI in biomedicine must transition from predictive systems based on historical data to interventional models capable of simulating novel treatment effects. Current architectures remain observational and cannot generalize to unobserved interventions.

Reasoning Papers AI safety

SIG

HYP

arXiv cs.AI·May 19

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

Study reveals dataset visibility asymmetry in multilingual NLP: 118 languages (59% of 200 most-spoken) have zero catalogued datasets per LRE Map and LDC. Using LLM-assisted citation-mining on Semantic Scholar, authors identify 609 unique datasets across 53 low-visibility languages, 356 publicly accessible. Data scarcity is a documentation and discoverability issue, not just production.

Benchmarks Open source Papers

SIG

HYP

arXiv cs.AI·May 19

Spatial Blindness in Whole-Slide Multiple Instance Learning

Whole-slide MIL models suffer from 'spatial blindness': accurate predictions but ignoring tissue architecture. ResTopoMIL fixes this by separating appearance statistics (prototype histogram) from spatial relations (graph branch with permutation constraint). Improvements across 9 WSI benchmarks with 1.15M parameters.

Papers Benchmarks Vision

SIG

HYP

arXiv cs.AI·May 19

ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse

ContraFix is an agentic framework for automated vulnerability repair combining differential runtime evidence and skill reuse. On SEC-Bench (C/C++) and PatchEval (Go, Python, JavaScript), it achieves 84.0% and 73.8% resolution rates with GPT-4-mini, outperforming baselines while costing less than one-third of comparable approaches.

AI Agents Code generation Reasoning

SIG

HYP

arXiv cs.AI·May 19

FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery

FactorizedHMR introduces a two-stage hybrid framework for video human mesh recovery. A deterministic regression module anchors torso and root, while a probabilistic flow-matching module completes ambiguous distal articulations (arms, legs). Geometry-aware supervision and classifier-free guidance improve recovery under occlusion.

Vision Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Differentiable Optimization Layered Safety-Critical Control for Risk-Aware Navigation via Conformal Prediction

Safety-critical control method for autonomous navigation in unknown environments. Uses conformal prediction to generate risk-aware obstacle ellipsoids accounting for sensor uncertainty, then two nested differentiable optimization layers to build control barrier functions. Validated through numerical simulations.

Robotics AI safety Reasoning

SIG

HYP

arXiv cs.AI·May 19

Artificial Intelligence can Recognize Whether a Job Applicant is Selling and/or Lying According to Facial Expressions and Head Movements Much More Correctly Than Human Interviewers

Deep learning models analyzing facial expressions and head movements in asynchronous video job interviews detect honesty and deception with 91% and 84% variance explained. Outperforms human evaluators on N=121 applicants.

Vision Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

An Interpretable Closed-Loop Intelligent Tutoring System for Multimodal Affective Feedback in Asynchronous Presentation Training

Closed-loop intelligent tutoring system using XGBoost to assess oral presentation skills via multimodal analysis (facial, vocal, textual, oculomotor). Trained on 10,360 MOOC videos, generates feedback aligned to 7-dimensional BARS scale. Study with 204 learners over 30 days: significant improvements (Cohen's d = 0.39-0.90), strong correlation between practice frequency and performance.

Evals Vision Voice

SIG

HYP

arXiv cs.AI·May 19

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

SaaSBench is the first benchmark to evaluate AI agents in enterprise SaaS engineering. It contains 30 complex tasks across 6 SaaS domains with 8 programming languages, 6 databases, and 13 frameworks. Experiments show >95% of failures occur before business logic: agents struggle to configure and integrate multi-component systems.

AI Agents Code generation Benchmarks

SIG

HYP

arXiv cs.AI·May 19

KairosHope: A Next-Generation Time-Series Foundation Model for Specialized Classification via Dual-Memory Architecture

KairosHope is a time-series foundation model replacing quadratic attention with dual-memory architecture (Titans modules + Continuum Memory System). Pre-trained on Monash archive via MTSM and contrastive learning, it fuses latent representations with statistical features. Superior results on UCR for HAR and sensor data.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

Few-Shot Network Intrusion Detection Using Online Triplet Mining

Network intrusion detection system using triplet networks with online triplet mining and KNN classifier. Few-shot approach detects attacks with as few as 10 malicious samples per class, outperforming standard supervised methods on small datasets and anomaly detection models (high false-positive rates).

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Controlling False Discovery in Arbitrarily Structured Hypothesis Spaces via Reproducing Kernels

New method to control False Discovery Rate (FDR) in structured hypothesis spaces using reproducing kernels. Reformulates the problem as regularized learning in RKHS, unifying continuous domains, graphs, and hierarchies. Validated on spatial data and differential gene expression tasks.

Papers Evals

SIG

HYP

arXiv cs.AI·May 19

Beyond Accuracy: Robustness, Interpretability and Expressiveness of EEG Foundation Models

Comparative study of 6 EEG foundation models across 8 datasets beyond clean accuracy. Robustness analysis (noise, channel dropout), interpretability via Attention-Aware Layer-Wise Relevance Propagation, and expressiveness through block-wise probing. Findings: no single model dominates all failure modes; models focus on task-appropriate brain regions but decode corrupted content poorly.

Benchmarks Evals AI safety

SIG

HYP

arXiv cs.AI·May 19

SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing

SynVA is a modular toolkit for generating vascular meshes and synthesizing anatomically consistent intracranial aneurysms. Combines flow-matching methods for healthy vessels with anatomy-conditioned approaches for aneurysm generation. Releases a dataset of 50,000 fully labeled mesh samples for medical vision tasks.

Vision Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Hidden in Memory: Sleeper Memory Poisoning in LLM Agents

Study of 'sleeper memory poisoning' attack against stateful LLM agents with persistent memory. Adversary corrupts external documents to inject false user memories. Success rates: 99.8% (GPT-5.5), 95% (Kimi-K2.6). Poisoned memories trigger attacker-intended actions in 60-89% of cases.

AI Agents AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

SparseSAM: Structured Sparsification of Activations in Segment Anything Models

SparseSAM introduces training-free structured sparsification for Segment Anything Model's ViT encoders. Using Stripe-Sort Attention (Z-order permutation) and Residual-Consistency MLP, it achieves 2x inference speedup and 2.8x memory reduction with only 0.004 mIoU loss at 0.4 density.

Vision Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Training Infinitely Deep and Wide Transformers

Theoretical paper on transformer training in mean-field regime (infinite depth and width). Authors model training as controlling a neural PDE (vs ODE for ResNets), establish well-posedness of forward pass, derive explicit formulas for Wasserstein gradients, and prove gradient flow convergence to global minima under NTK injectivity conditions.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Lance is a lightweight unified multimodal model supporting understanding, generation, and editing of images and videos. Built on dual-stream mixture-of-experts architecture with modality-aware rotary positional encoding, it combines collaborative multi-task training and adaptive data scheduling to outperform existing open-source unified models in visual generation.

Vision Video generation Image generation

SIG

HYP

arXiv cs.AI·May 19

StructLens: A Structural Lens for Language Models via Maximum Spanning Trees

StructLens analyzes the internal organization of representations in language models through maximum spanning trees built on residual streams. The framework reveals that middle layers strongly organize nearby tokens, and that smaller local units emerge before larger units during pre-training.

Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

OSCAR quantizes KV caches to INT2 for long-context LLMs by estimating attention-aware covariance structures offline. Tested on Qwen3 (4B–32B) and GLM-4.7 (358B), it reduces accuracy gap to 1.42–3.78 points vs BF16, cuts memory by 8x and improves throughput by 7x. Custom INT2 kernel compatible with vLLM/SGLang.

Reasoning Benchmarks Infrastructure

SIG

HYP

arXiv cs.AI·May 19

Evidence of a Cognitive Shift in AI Education: How Students Are Rethinking Human Intelligence?

Longitudinal study (2020-2026) of 471 AI students revealing a preference shift: from 2024 to 2026, valuation of human intelligence rises from 53% to 65% in technical courses and 90% in design courses. Authors identify four phases (hype, distrust, trust, dependency) and conclude AI is being reappraised as a routine tool.

Evals AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

Optimal Knock-Pick Planning for Tightly Packed Tabletop Blocks With Parallel Grippers

Studies rearrangement of densely packed tabletop objects with parallel grippers. Introduces a directional knock primitive to overcome infeasible direct picks. Formulates optimal knock-pick problem and proposes abstractions with maximum-weight perfect matching for polynomial-time computation of action-minimizing plans. Validated in simulation (IsaacSim).

Robotics Reasoning

SIG

HYP

arXiv cs.AI·May 19

Virtues of Ordered Chaos: Planning with Topple Actions in Tabletop Stack Rearrangement

Study of tabletop stack rearrangement using nonprehensile toppling actions. A novel graphical abstraction models interleaved pick-and-place and topple operations. IsaacSim physics simulation benchmarks demonstrate faster execution than pick-and-place alone.

Robotics Papers

SIG

HYP