May 2026

3149 articles

Flowing with Confidence

Flow Matching with Confidence (FMwC) adds per-sample confidence scores to generative models at standard sampling cost. By injecting input-dependent multiplicative noise and propagating variance through the ODE, the method enables filtering, trajectory editing, and adaptive stepping. The confidence score correlates with the divergence magnitude of the learned velocity field.

Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

ChartDesign: Towards LLM Designer of Data Visualization

ChartDesign fine-tunes LLMs (Phi3, Qwen3, InternVL2.5) via LoRA to automatically generate chart design attributes from tabular data. Trained on curated corpus (PewResearch, CharXiV), the system achieves 84% accuracy on held-out test set vs 53% baseline, generalizing to unseen domains.

Fine-tuning Vision Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Beyond Compliance: How AI Could Help Creative Writers by Refusing Them

Qualitative study with 22 creative writers on intentional AI refusals in writing assistance. Researchers explore how refusals (saying "no") could introduce reflective friction rather than blind compliance, depending on context (planning, drafting, reviewing) and individual preferences.

Prompt engineering AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

Helping Customers in Distress: An LLM-powered Agent that Converses, Probes, and Routes

A bank develops a conversational AI agent to triage fraud, scam, and dispute cases. The agent uses LLMs to ask targeted questions and route customers to appropriate specialist teams. Evaluation via synthetic digital twins simulating realistic dialogues. Result: +30.6% improvement in classification accuracy with compliance guardrails.

AI Agents Reasoning AI safety

SIG

HYP

arXiv cs.AI·May 19

Actionable World Representation

WorldString is a neural architecture modeling the state manifold of real-world objects from point clouds or RGB-D video streams. Designed as a differentiable digital twin, it serves as a foundational building block for physical world models integrating policy learning and neural dynamics.

Vision Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

SkillGenBench is a benchmark for evaluating skill generation pipelines for LLM agents. It covers two regimes: task-conditioned generation and task-agnostic generation, with procedural sources grounded in repositories or documents. Experiments reveal substantial performance variation and distinct failure modes between software repositories and long-form documents.

AI Agents Benchmarks Code generation

SIG

HYP

arXiv cs.AI·May 19

Learning Quantifiable Visual Explanations Without Ground-Truth

New metric to evaluate XAI methods without ground-truth, based on continuous input perturbation. Measures sufficiency and necessity of attributed information. Also proposes trainable XAI method as adapter on black-box models, generating causal explanations without degrading performance.

Evals AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

AI for Auto-Research: Roadmap & User Guide

Comprehensive study of AI-assisted research systems through April 2026. LLMs excel at structured, retrieval-grounded, and tool-mediated tasks but remain fragile for genuinely novel ideas and scientific judgment. End-to-end autonomous systems have not yet consistently reached major-venue acceptance standards. Human-governed collaboration is the most credible deployment paradigm.

AI Agents Reasoning Evals

SIG

HYP

arXiv cs.CL·May 19

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

KVDrive is a multi-tier KV cache management system for long-context LLM inference, orchestrating cache placement across GPU/DRAM/SSD, pipeline scheduling, and cross-tier coordination. The prototype achieves 1.74x higher throughput than state-of-the-art systems while preserving accuracy.

Infrastructure Reasoning

SIG

HYP

arXiv cs.AI·May 19

Latent Action Reparameterization for Efficient Agent Inference

LAR (Latent Action Reparameterization) compresses LLM agent action spaces by learning semantic multi-step latent actions. This reduces effective decision horizon and inference costs while preserving expressiveness. Across benchmarks, LAR decreases action tokens and wall-clock inference time without degrading task success rates.

AI Agents Code generation Reasoning

SIG

HYP

arXiv cs.AI·May 19

When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State

Paper introducing trace-based evaluation to detect when agents hit business KPIs while violating behavioral constraints. In hotel pricing with hidden competitor state, authors show PPO variants fail trace alignment while behavior cloning and Trace-Prior RL better preserve price/bid distributions and rate discipline.

Reinforcement learning Evals AI Agents

SIG

HYP

arXiv cs.AI·May 19

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

MM-ToolBench is a benchmark for omni-modal tool-using agents in real-world workflows. 100 executable tasks (customer service, intelligent creation), 27 MCP servers, 324 tools. Closed-loop multimodal verification: agents execute, inspect, and self-correct. Claude Opus 4.6 achieves 32% success vs 94% human baseline.

AI Agents MCP Benchmarks

SIG

HYP

arXiv cs.AI·May 19

AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

AMR-SD introduces asymmetric meta-reflective self-distillation to improve token-level credit assignment in LLM reinforcement learning. The method compresses diagnostic signals into self-generated Socratic hints and uses Causal Information Gain with asymmetric ReLU-gated threshold for sparse token-level advantage modulation, preventing late-stage training collapse.

Reinforcement learning Reasoning Alignment

SIG

HYP

arXiv cs.AI·May 19

OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models

OCCAM is a framework for explaining black-box image classifier decisions through causal visual concepts. It discovers concepts in open-set manner, localizes them via text-guided segmentation, and measures causal contribution through object-level interventions. OCCAM aggregates interventional evidence to induce a structured ontology revealing concept dependencies and systematic model biases.

Vision Evals Reasoning

SIG

HYP

arXiv cs.AI·May 19

Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities

Brain tumor segmentation method using multimodal MRI with virtual nodes and dynamic graph neural networks. One-stage framework handling missing modalities through adaptive adjacency matrices and heterogeneous weight matrices. SOTA results on BRATS-2018/2020 with incomplete modalities.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

QSTRBench is a benchmark evaluating LLMs' ability to reason with qualitative spatial and temporal reasoning (QSTR). It covers 9 calculi (Point Algebra, Allen's Interval Algebra, RCC-5/8/22, etc.) with composition tables, converse relations, and conceptual neighbourhoods. Tested models outperform guessing but none answer all questions correctly. RCC-22 proves most difficult.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

SD-Search introduces on-policy hindsight self-distillation for search-augmented reasoning agents. A single model acts as both student and teacher: the teacher, conditioned on past rollout outcomes, guides the student via token-level Jensen-Shannon divergence at query positions. No external teacher model or additional annotations needed.

Reasoning Reinforcement learning RAG

SIG

HYP

arXiv cs.AI·May 19

Scheduling That Speaks: An Interpretable Programmatic Reinforcement Learning Framework

ProRL is a programmatic reinforcement learning framework for combinatorial optimization (job shop scheduling). It generates interpretable policies as human-readable programs via a domain-specific language (DSL-S), exploring the program space through local search and Bayesian optimization. Outperforms classical heuristics and DRL baselines with minimal training episodes.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

arXiv paper on spatial limitations of MLLMs in multi-agent environments. Models suffer from a "Cartesian Illusion": lack grounded 3D topological understanding. Authors propose an Epistemic Sensory Bottleneck module with Anchor-Based Embodied Spatial Decomposition CoT to improve second-order spatial inference (Theory of Mind). Zero-shot baseline: 42% accuracy.

Vision Multi-agent Reasoning

SIG

HYP

arXiv cs.AI·May 19

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

PPR-GDE, an RL method for open-ended generation, uses pairwise preference rewards and group-based diversity to prevent diversity collapse. Without scalar rewards, it preserves subjective evaluations and encourages semantic dispersion within response groups.

Reinforcement learning Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

Scalable Environments Drive Generalizable Agents

Position paper arguing that agent generalization requires environment scaling—expanding the distribution of executable rule-sets agents interact with, beyond trajectory or task scaling. Proposes unified taxonomy separating trajectory scaling, task scaling, and environment scaling. Contrasts programmatic generators with generative world models for constructing scalable environments.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Global Automation Atlas

Study of 124 countries covering 99% of global GDP. Task-based automation exposure measure: 3.3% in South Sudan to 61.6% in China. Distinguishes labor-substituting vs labor-augmenting automation. AI more prevalent in substitution in low-income countries, augmentation in high-income. Women disproportionately exposed to substitution.

Benchmarks Papers Regulation

SIG

HYP

arXiv cs.AI·May 19

Generative AI and the Productivity Divide: Human-AI Complementarities in Education

A randomized controlled experiment shows LLM access significantly increases average performance, but gains are unevenly distributed. AI Interaction Competence (ability to elicit, filter, and verify outputs) predicts benefits, not GPA. A scaffolding intervention (conceptual maps) reduces outcome variance.

Reinforcement learning Evals Alignment

SIG

HYP

arXiv cs.AI·May 19

POST: Prior-Observation Adversarial Learning of Spatio-Temporal Associations for Multivariate Time Series Anomaly Detection

POST introduces an adversarial learning framework for multivariate time series anomaly detection. The model combines graph neural networks with minimax optimization over adjacency matrices to address spatial over-generalization. Evaluation on public and synthetic benchmarks with channel-wise anomaly localization.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

TaskGround is a task-inference framework for household agents operating on complete scenes. It structures reasoning in three steps: grounding (extracting relevant context), inference (executable structure), execution (action sequences). Evaluated on FullHome (400 tasks), it improves success rates and makes Qwen3.5-9B competitive with GPT-4 while reducing token costs by 18x.

AI Agents Reasoning Robotics

SIG

HYP

arXiv cs.AI·May 19

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

ConsumerSimBench, a benchmark built from 1,553 Chinese social-media topics and 23,122 reaction criteria, evaluates whether LLMs can reconstruct real consumer reaction patterns. Gemini-3.1-Pro covers only 47.8% of criteria, revealing a major gap between technical performance and consumer intuition. A multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6%.

Benchmarks Evals Multi-agent

SIG

HYP

arXiv cs.AI·May 19

Learning to Solve Compositional Geometry Routing Problems

Study of Compositional Geometry Routing Problem (CGRP), a generalization of routing problems covering points, lines, areas, and hybrid geometries. Proposes DiCon, a solver with differential attention and contrastive learning to handle asymmetry and enlarged action spaces. Results show strong performance, versatility, and superior generalization across diverse instances.

Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

DocOS: Towards Proactive Document-Guided Actions in GUI Agents

DocOS is a benchmark evaluating GUI agents capable of proactively searching online documentation to solve long-tailed tasks. Experiments reveal two bottlenecks: difficulty reliably locating relevant information and faithfully grounding retrieved instructions into precise GUI actions.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 19

New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions

New zeroth-order hard-thresholding algorithm with variance reduction for ℓ0-constrained optimization. Addresses SZOHT's limitation on random directions by mitigating conflict between ZO gradient deviation and hard-thresholding expansivity. Improved convergence rates validated on ridge regression and black-box adversarial attacks.

Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

LGBO (LLM-Guided Bayesian Optimization) embeds LLM semantic reasoning into every Bayesian Optimization iteration via a region-lifted preference mechanism. Tested on physics, chemistry, biology, and materials science benchmarks, LGBO reaches 90% of best observed value in 6 iterations for Fe-Cr battery electrolyte optimization, versus 10+ for standard BO.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning

New approach for learning generalized policies in classical planning using Relational Graph Neural Networks (R-GNNs). Authors introduce efficient lookahead search encoding and relational abstraction to improve scalability on IPC 2023 benchmark. Results outperform classical planner LAMA.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Ethical Hyper-Velocity (EHV): A Provably Deterministic Governance-Aware JIT Compiler Architecture for Agentic Systems

EHV is a JIT compiler architecture embedding formal verification of AI governance policies directly into the inference pipeline. Using CRDTs and TEEs, it achieves sub-millisecond formal determinism (SMFD) and reduces governance latency from days to O(1), eliminating the trade-off between deployment velocity and compliance.

AI Agents AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

GVG (Generative Visual Grounding) uses an EEG-to-image generative model to translate brain activity into visual images, bypassing text-only alignment. Tested on GVG-X-Omni (170M tuned params) and GVG-Janus (trimodal), the framework improves EEG understanding and visual generation by leveraging MLLMs' visual priors.

Vision Multi-agent Embeddings

SIG

HYP

arXiv cs.AI·May 19

LAST-RAG: Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection

LAST-RAG proposes a method for selecting stochastic degradation models to estimate remaining useful life (RUL). It combines observed trajectories and domain context via retrieval from a local evidence bank, with RCRUS mechanism to prevent premature model elimination. Experiments show outperformance versus statistical and prognostic baselines.

RAG Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

LMAC leverages LLM reasoning to design communication protocols in MARL, enabling agents to reconstruct the underlying state uniformly and accurately. The approach iteratively refines protocols using an explicit state-awareness criterion. Experiments on MARL benchmarks demonstrate substantial performance gains over prior baselines.

Multi-agent Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition

DuIVRS-2 is an LLM-based Interactive Voice Response system deployed at Baidu Maps for large-scale POI attribute acquisition. Using FSM-guided data augmentation, selective generation, and Chain-of-Thought mechanisms, the system processes 0.4 million calls daily with 83.9% Task Success Rate and 130ms latency.

AI Agents Reasoning Voice

SIG

HYP

arXiv cs.AI·May 19

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

PAIR is an internal reward model for multi-step LLM training via GRPO. It combines a hidden-state probe (belief consistency) with a lightweight attention head to generate dense step-level reward signals without external model calls or ground-truth dependencies.

Reinforcement learning Reasoning AI Agents

SIG

HYP

arXiv cs.AI·May 19

KISS - Knowledge Infrastructure for Scientific Simulation: A Scaffolding for Agentic Earth Science

KISS introduces a Knowledge Infrastructure (KI) enabling AI agents to execute complex Earth science simulations. On 3,000 trials, KI-equipped agents produced physically plausible simulations in 84% of cases vs. <40% without KI. An automated Knowledge Dissection Toolkit (KDT) generated 119 KIs across 14 Earth-science domains, showing operational expertise is structured and extractable rather than ad hoc.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective

arXiv paper on supervised fine-tuning (SFT) effectiveness for LLMs. Authors show SFT primarily removes noise-like token interactions but rarely acquires reliable new ones. The denoising phase is extremely brief; continued fine-tuning introduces overfitted interactions. Implications for early stopping and LLM training.

Fine-tuning Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

Accelerating AI-Powered Research: The PuppyChatter Framework for Usable and Flexible Tooling

PuppyChatter is a software framework designed to simplify development of LLM-based AI applications. It combines the simplicity of vendor-specific SDKs with vendor-neutrality principles of abstraction frameworks, reducing complexity and vendor lock-in risks.

Tools Infrastructure

SIG

HYP

arXiv cs.AI·May 19

Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search

Amazon Music deploys a robust neural sparse retrieval system for large-scale music search. The system handles misspellings, transpositions, and phonetic variations with 91.4% recall@10 on 6M documents, outperforming trigrams (57.7%). Inference-free architecture with granular subword tokenization (max 3 chars) and zero online latency.

RAG Embeddings Vector search

SIG

HYP

arXiv cs.AI·May 19

Divergence-Suppressing Couplings for Rectified Flow

Authors identify that trajectory entanglement in Rectified Flow stems from nonzero divergence regions in the learned velocity field. They propose an offline correction that attenuates the divergent component during coupling generation, with no deployment overhead. Improvements validated on 2D benchmarks and image generation.

Image generation Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

EXG: Self-Evolving Agents with Experience Graphs

EXG is an experience graph framework for self-evolving LLM-based agents. It organizes successes and failures into structured, relational representations, enabling real-time cross-task experience reuse and offline reuse as external memory. Tested on code generation and reasoning benchmarks, EXG outperforms reflection and memory-based baselines.

AI Agents Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

EGI: A Multimodal Emotional AI Framework for Enhancing Scrum Master Real-time Self-Awareness

EGI is a multimodal framework to monitor unconscious emotions of Scrum Masters in real-time. The system combines speech-to-text transcription (WER 10%), prosody analysis, emotional vocabulary matching, and context-aware suggestions via open-source multi-module API. Testing shows significant improvement in emotional awareness during simulated agile meetings.

Voice AI Agents AI safety

SIG

HYP

arXiv cs.AI·May 19

Multimodal Cultural Heritage Knowledge Graph Extension with Language and Vision Models

Novel approach to extend Knowledge Graphs for French cultural heritage. Authors introduce WJoconde, a multimodal KG integrating text and images, with three variants and a benchmark for Knowledge Graph Completion. They propose a framework combining LLMs and Vision-Language Models for automated data extraction and validation, improving KG reliability.

Vision RAG Benchmarks

SIG

HYP

arXiv cs.AI·May 19

SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

SAPO improves generative recommendation by aligning reinforcement learning optimization to individual reasoning steps. Instead of assigning a single advantage to the entire response, SAPO computes separate group-relative advantages for each reasoning step and SID token, stabilizing training and outperforming baselines across three real-world datasets.

Reinforcement learning Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

Dual-process memory architecture for scientific agents: decouples episodic window (10 messages) from semantic consolidation (3 tokens/message). Evaluation on 15,000 messages across 6 LLMs (OpenAI, Anthropic, Google): maintains 70-85% accuracy at 10,000 messages with 62% fewer tokens. Identifies trade-offs: Dual Process excels at numeric/temporal queries, RAG for historical retrieval.

AI Agents Reasoning RAG

SIG

HYP

arXiv cs.AI·May 19

GraphMind: From Operational Traces to Self-Evolving Workflow Automation

GraphMind automates complex operational workflows by extracting action graphs from human resolution traces, then executes them via a multi-agent engine with LLM reasoning. An adaptive reinforcement mechanism (ATR) optimizes successful paths. Deployed across 4 cloud services, the system outperforms a Trace-RAG baseline with a 4.95/5 expert review score.

Multi-agent RAG Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents

NeuSymMS is a hybrid neuro-symbolic memory system for LLM agents. It couples neural fact extraction from dialogue with a CLIPS-based expert system that classifies, deduplicates, and reconciles facts. Knowledge is stored as subject-relation-value triples in a relational database, with short/long-term memory and access-based promotion.

AI Agents RAG Reasoning

SIG

HYP

arXiv cs.AI·May 19

Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis

MEMOIR, a memory-guided tree-search framework, automatically synthesizes solvers for combinatorial optimization using LLMs. With a two-level memory hierarchy (branch-local and global), it achieves 96.7% solution validity across 7 problems (scheduling, routing, packing), outperforming baselines by 9.2 points and reducing run-to-run validity variance by over an order of magnitude.

AI Agents Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

Self-supervised Hierarchical Visual Reasoning with World Model

ResDreamer, a hierarchical world model, reconstructs residuals at each layer to progressively abstract visual dynamics in a self-supervised manner. Without domain-specific knowledge, it achieves state-of-the-art sample and parameter efficiency for RL in 3D adversarial environments. Code released.

Reinforcement learning Reasoning Vision

SIG

HYP

arXiv cs.AI·May 19

Multi-Party Multi-Objective Optimization as Consensus Search: Runtime Analysis of Cross-Party Recombination

Theoretical study of multi-objective evolutionary algorithms for multi-party optimization (MPMOP). On MP-JCG benchmark, payoff-guided mutation requires Θ(n²) fitness evaluations to cross a gap region, while CPR-NSGA-II achieves O(n log n) via cross-party recombination. Runtime analysis on BPBOMST (multi-party minimum spanning tree) with instance-parameterized bounds.

Multi-agent Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design

Theoretical paper on computational challenges of token economics in LLM systems. Introduces the "Token Economics Trilemma": tensions between fine-grained valuation, low-latency execution, and allocation optimality. Identifies three technical areas: real-time value accounting, constrained resource allocation, and economic-aware system architecture.

Infrastructure Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 19

D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

D²Evo is an RL framework to enhance LLM reasoning through self-evolution. The method generates medium-difficulty training samples by mining anchors matched to model capability, then jointly optimizes a Questioner and Solver. Results: outperforms existing methods on mathematical reasoning benchmarks with <2K real examples.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Algorithmic Cultivation: How Social Media Feeds Shape User Language

Longitudinal study of 235M posts from 4M Bluesky users showing algorithmic feed exposure (News, Science, Blacksky) measurably shapes user language: semantic alignment, register formalization, psycholinguistic restructuring. Reposting is the strongest predictor of linguistic convergence across feeds.

Papers Evals AI safety

SIG

HYP

arXiv cs.AI·May 19

QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI

QQJ is an evaluation framework for generative AI combining expert-designed multi-dimensional rubrics and LLM evaluator calibration on small high-quality annotation sets. Tested on text and image generation, QQJ shows stronger alignment with human judgment than traditional automatic metrics and unconstrained LLM-based evaluators.

Evals Benchmarks Alignment

SIG

HYP

arXiv cs.AI·May 19

CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

CBT-Audio is a dataset of 1,802 patient turns from 96 public CBT recordings with expert-validated distress labels. Evaluation of 10 open-source audio language models shows audio improves distress estimation over text alone in 8/10 model families, with strongest gains when verbal content and vocal delivery diverge.

Benchmarks Voice Evals

SIG

HYP

arXiv cs.AI·May 19

Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training

Learning-Zone Energy (LZE) is an online data selection framework for RL post-training of LLMs. Tested on Qwen 1.5B-8B across GSM8K and MATH, it retains 40% of training data per step while matching full-data baselines, with OOD gains of +45.9% on AIME25 and 36% FLOP reduction.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks

BoLT is an open-source benchmark for black-box optimization applied to LLMs. It covers hyperparameters, data mixtures, and prompts via lightweight surrogate models fitted to thousands of real experiments. Benchmarking Bayesian Optimization and BBO methods reveals gaps in existing approaches.

Benchmarks Open source Papers

SIG

HYP

arXiv cs.AI·May 19

Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification

CardioThink, a physician-inspired MLLM framework, structures ECG diagnosis through explicit reasoning stages (rhythm, conduction, morphology, impression) to enhance interpretability. Structured Set Policy Optimization (SSPO) aligns clinical reasoning without manual annotations, outperforming direct prediction approaches across ECG benchmarks.

Reasoning Vision Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

CyberCorrect: A Cybernetic Framework for Closed-Loop Self-Correction in Large Language Models

CyberCorrect formalizes LLM self-correction as a closed-loop control system. A tri-modal error detector (self-consistency, verbalized confidence, logic-chain verification) and type-directed correction controller achieve 79.8% accuracy on CyberCorrect-Bench (440 reasoning tasks), +6.2pp over existing methods, reducing overshoot by 41% via convergence control.

Reasoning Evals Papers

SIG

HYP

arXiv cs.AI·May 19

Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning

Response-free item difficulty modelling for multiple-choice questions using fine-tuned transformers. End-to-end approach on item wording eliminates manual feature engineering. Multi-task variant with auxiliary QA objective delivers significant improvements in small-sample regimes.

Fine-tuning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

A2RBench is an automated pipeline for generating formally verifiable abstract reasoning benchmarks. Using programmatic verification (cycle consistency), it eliminates hallucinations and scales task variations. Evaluations show current LLMs score 39.8% vs 68.5% for humans, and struggle with complex 3D tasks.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation

Systematic study of reasoning faithfulness in Vision-Language-Action (VLA) driving models. Analysis of 300 Alpamayo-R1-10B inferences across 100 PhysicalAI-AV scenarios reveals: reasoning fidelity 42.5%, 94 missed pedestrians, 97.7% trajectory fragility under visual perturbations, 48.3% reasoning-action consistency. Proposes four-component safety architecture.

Vision Reasoning AI safety

SIG

HYP

arXiv cs.AI·May 19

CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean

CAM-Bench is a Lean 4 benchmark of 1,000 computational and applied mathematics problems (optimization, numerical linear algebra, numerical analysis) adapted from textbooks with locally recovered context via dependency-recovery pipeline. Evaluation of LLMs and formalization agents reveals failures in tracking local assumptions and long-horizon control in Lean.

Benchmarks Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

CatalyticMLLM: A Graph-Text Multimodal Large Language Model for Catalytic Materials

QE-Catalytic-V2 is a unified graph-text multimodal LLM for catalytic materials. It integrates property prediction and inverse design in a single shared representation space, eliminating distribution shifts between decoupled models. Demonstrates superior performance on relaxed-energy prediction and inverse design tasks.

Papers Benchmarks Vision

SIG

HYP

arXiv cs.AI·May 19

ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

ChemVA framework advances LLM understanding of chemical reaction diagrams by addressing visual and semantic bottlenecks. Uses Visual Anchor mechanism for functional group detection and semantic alignment to activate chemical reasoning. Achieves 92.0% structural recognition accuracy on OCRD-Bench with ~20 percentage point gains across 9 diverse LLMs.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

CAREBench: Evaluating LLMs' Emotion Understanding by Assessing Cognitive Appraisal Reasoning

CAREBench is a benchmark evaluating LLMs' emotion understanding through cognitive appraisal reasoning. Tested on 6 models with complete inferential chain annotations (first/third-person perspectives), it shows stronger models match humans on some tasks but fall short on appraisal reasoning and positive emotion recognition.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.AI·May 19

From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

Shallow neural network agents master the card game Schnapsen through reinforcement learning. RLBot, trained via asynchronous Monte Carlo updates, outperforms MLPBot (supervised imitation) and achieves statistically significant wins against RdeepBot, a search-based baseline. Combining learned value functions with deeper lookahead during gameplay improves performance.

Reinforcement learning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop

MADP is a multi-agent architecture for enterprise document automation, combining deep learning classification and LLM extraction with human validation. Deployed on 955 real documents, it achieves 97% full-pipeline automation and reduces FTE requirements by 70%. 98.5% document-level accuracy with human-in-the-loop; 69% CO2 reduction vs manual processing.

Multi-agent AI Agents Code generation

SIG

HYP

arXiv cs.AI·May 19

Dynamics of collective creativity in AI art competitions

Analysis of 130,882 images from 368 remix parties on Artbreeder (13 months). Images converged toward common thematic attractors (steampunk, alien architecture) while becoming simpler. Paradox: more novel parents produced more complex, liked children, yet users preferred remixing less novel images.

Image generation Papers Evals

SIG

HYP

arXiv cs.AI·May 19

Latent Heuristic Search: Continuous Optimization for Automated Algorithm Design

Automated heuristic discovery via continuous optimization in latent space. Encoder maps discrete programs to continuous embeddings, differentiable surrogate model predicts performance, invertible normalizing flow regularizes optimization trajectory. Evaluation on TSP, CVRP, KSP, and Online Bin Packing shows competitive results against evolutionary baselines.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

ECC, a query clustering algorithm, calibrates semantic embeddings through model comparisons to align surface semantics with latent LLM capabilities. Using a Bradley-Terry model, it improves capability ranking by 17.64 points over human-labeled baselines and 18.02 points over embedding-based baselines, with applications to query routing.

Evals Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 19

Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics

Systematic investigation of logicality in LLM scientific reasoning. Authors develop a logicality-enriched methodology with assessment criteria and data sampling methods for logicality-guided training. Experiments on three backbone LLMs using physics problems extracted from academic literature. Code released.

Reasoning Fine-tuning Papers

SIG

HYP

arXiv cs.AI·May 19

RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation

RAGA is an LLM-based autonomous agent for knowledge graph construction and retrieval-augmented generation. It combines CRUD operations, a ReAct loop with Read-Search-Verify-Construct constraint, and KG-vector synchronization for hybrid retrieval. QASPER experiments show gains in answer and evidence quality.

AI Agents RAG Reasoning

SIG

HYP

arXiv cs.AI·May 19

Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

Large Reasoning Models generate traces aligned with human reaction times, but this alignment persists regardless of inference-time reasoning budget. Study across GPT-OSS-20B and GPT-OSS-120B: three effort levels, six cognitive tasks. Token allocation tracks fine-grained human difficulty patterns and reflects a structure crystallized at training time, not modulated in real-time.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

The Alpha Illusion: Reported Alpha from LLM Trading Agents Should Not Be Treated as Deployment Evidence

Critical study of LLM-based trading agents (FinCon, FinMem, TradingAgents, FinAgent, QuantAgent, FLAG-Trader). Reported Sharpe ratios do not constitute deployment evidence: temporal contamination, unmodeled frictions, and insufficient predictive calibration invalidate claims. Proposes P1-P6 protocol and modular architecture with LLM as audit interface.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

Towards Human-Level Book-Writing Capability

Researchers present a framework for book-scale creative writing. Starting from public-domain novels, they build a multi-resolution scaffold (summary → chapters → scenes → full text) and train a long-context model on prompt-to-book trajectories. Goal: generate human literary prose rather than generic assistant-style text.

Fine-tuning Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models

PersonaArena is a dynamic simulation framework for evaluating and improving persona-level role-playing in LLMs. It leverages a filtered corpus of user-generated social content, constructs a nuanced persona bank, and simulates multi-turn interactions in social environments. A multi-agent debating judge provides holistic and unbiased assessment.

AI Agents Multi-agent Evals

SIG

HYP

arXiv cs.AI·May 19

DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios

DriveSafe is a framework for risk detection in autonomous driving scenarios. It generates spatially grounded captions enriched with motion and depth cues, then assesses risks using a fine-tuned adapter module on caption-risk pairs. Achieves SOTA on DRAMA benchmark.

Vision Reasoning AI safety

SIG

HYP

arXiv cs.AI·May 19

Prefix-Adaptive Block Diffusion for Efficient Document Recognition

PA-BDM improves Block Diffusion Models for document recognition by replacing bidirectional denoising with causal prefix-to-suffix denoising. Using Confidence-gated Structural Loss and Progressive Prefix Commitment, the 3B model achieves 71.6% higher inference throughput than MinerU-Diffusion 2.5B.

Papers Code generation Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

Study of autonomous AI agents in multi-echelon supply chains using MIT Beer Game. Reasoning models reduce costs by 67% vs human teams, but reveal an 'agent bullwhip effect': amplification of decision unreliability across echelons. A GRPO-based reinforcement-learning post-training framework using system-level rewards improves reliability and reduces tail events.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.AI·May 19

Brain Vascular Age Prediction Using Cerebral Blood Flow Velocity and Machine Learning Algorithms

Study using Transcranial Doppler (TCD) and MOCAIP algorithm to predict brain vascular age. 168 healthy and 277 diseased subjects (stroke, Alzheimer's, MCI) analyzed. Model predicts accelerated aging in patients, with 3.69-year overestimation in healthy subjects.

Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study

EEG study of 27 participants analyzing neural mechanisms for detecting AI hallucinations. Researchers recorded brain activity during verification of image descriptions generated by an MLLM. Results show that misjudged hallucinations fail to trigger standard fact-verification neural pathways.

Vision AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction

Unified framework for disease trajectory modeling in clinical AI, integrating factual forecasting, counterfactual estimation, and policy evaluation. Addresses treatment assignment bias, time-varying confounding, and observation bias to transform static predictions into treatment-sensitive dynamic estimates.

Reasoning Evals AI safety

SIG

HYP

arXiv cs.AI·May 19

NGM: A Plug-and-Play Training-Free Memory Module for LLMs

NGM is a training-free memory module for LLMs using a Causal N-Gram Encoder and Cosine-Gated Memory Injector. Tested on Qwen3 (0.6B-14B), it improves average performance by 0.5-1.2 points, with notable gains on code generation (+3.0 LiveCodeBench) and knowledge-intensive tasks (+3.03 GPQA).

Qwen Code generation Reasoning

SIG

HYP

arXiv cs.AI·May 19

Reasoning Can Be Restored by Correcting a Few Decision Tokens

Reasoning models outperform base LLMs on complex benchmarks. Study shows the advantage stems from a small set of early decision tokens (~8% on Qwen3-0.6B), concentrated in planning phases. Selective intervention by the reasoning model at these critical tokens restores performance without major computational overhead.

Reasoning Benchmarks Qwen

SIG

HYP

arXiv cs.AI·May 19

Learning to Learn from Multimodal Experience

New paradigm for experience-driven learning in multimodal settings: agents learn to dynamically construct and organize memory based on task requirements and interaction history, rather than relying on fixed memory schemas. Adaptive memory design improves performance and generalization across multimodal tasks.

AI Agents Reasoning Vision

SIG

HYP

arXiv cs.AI·May 19

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

HT-GRPO, a hierarchical reinforcement learning method for diffusion multi-modal models, organizes optimization into three stages (global, structure, refinement). It solves multiple unmasking sequences and assigns differentiated rewards based on token importance. Tests on MMaDA and Lumina-DiMOO show gains on GenEval and DPG benchmarks.

Reinforcement learning Image generation Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Voices in the Loop: Mapping Participatory AI

Study of an open-source interactive atlas mapping 200+ participatory AI initiatives. Reproducible protocol for discovery, vetting, and harmonization of cases. Findings: initiatives concentrated in few countries, participation mostly in problem formulation and evaluation, rarely in model development.

AI safety Regulation

SIG

HYP

arXiv cs.AI·May 19

The Lattice Representation Hypothesis of Large Language Models

A hypothesis proposes that LLMs encode concept lattices in their embedding geometry. The framework unifies the Linear Representation Hypothesis with Formal Concept Analysis (FCA), showing that linear attribute directions induce lattices via half-space intersections. Experiments on WordNet validate that embeddings capture logical and hierarchical structures.

Reasoning Papers Embeddings

SIG

HYP

arXiv cs.AI·May 19

GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction

GRID is an end-to-end framework for constructing security knowledge graphs from cyber threat intelligence articles. Using Qwen3-4B-Instruct, it combines graph extraction, text revision, and a task bank (multi-choice questions + regex) to generate stable rewards. On 249 CTI articles, the Task-bank Reward model achieves 84.62% precision, 64.91% recall, and 68.53% Avg F1.

Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

VGGT-CD: Training-Free Robust Registration for 3D Change Detection

VGGT-CD is a training-free pipeline for 3D change detection from multi-view images. It decouples cross-temporal registration from dynamic-change interference via joint keyframe inference and dense reconstruction purification. On the World Across Time benchmark, it reduces Absolute Trajectory Error by 44% outdoors and 59% indoors, 6× faster.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Pedestrian-Aware LLM-Driven Behavioral Planning for Autonomous Vehicles

LLM-based behavioral planning framework for autonomous vehicles to anticipate pedestrian behavior. Evaluated on SUMO: 68% collision-free success rate zero-shot (vs 17.7% deep RL), 96% with few-shot episodic memory. Interpretable decisions with cross-behavior transfer across scenarios.

Reasoning Reinforcement learning AI safety

SIG

HYP

arXiv cs.AI·May 19

Sustainable Intelligence for the Wild: Democratizing Ecological Monitoring via Knowledge-Adaptive Edge Expert Agents

On-device AI system for ecological monitoring in remote areas. Architecture separates visual perception from reasoning using dynamic knowledge base, eliminating cloud dependency and continuous retraining. Collaboration with biologists and Indigenous communities for ethical AI co-development.

AI Agents Vision RAG

SIG

HYP

arXiv cs.AI·May 19

Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data

Post-hoc multimodal alignment method using relative representations at token level to match separately pre-trained encoders with limited paired data. Learns learnable anchors in each modality space to induce consistent cross-modal similarity patterns. Outperforms existing methods on zero-shot classification, cross-modal retrieval, and zero-shot segmentation.

Embeddings Vision RAG

SIG

HYP

arXiv cs.AI·May 19

How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

F³A is a training-free router for visual token pruning in vision-language models. It selects relevant visual tokens via question-conditioned cues without extra LLM forward passes, reducing inference costs while maintaining performance across model scales.

Vision Reasoning Infrastructure

SIG

HYP

arXiv cs.AI·May 19

Data-driven and distributed governance of building facilities management using decentralized autonomous organization, digital twin, and large language models

Decentralized building management framework integrating DAOs, digital twins, LLMs, and blockchain for transparent governance. System evaluated on cost efficiency, scalability, data security, and usability via System Usability Scale and expert interviews.

AI Agents Reasoning Infrastructure

SIG

HYP

arXiv cs.AI·May 19

Harnessing LLM Agents with Skill Programs

HASP converts textual skills for LLM agents into executable Program Functions that actively intervene in the agent loop at failure-prone states. The framework achieves 25% improvement on web-search vs ReAct and 30.4% gain on math/coding vs Search-R1 through inference-time intervention, post-training, or self-improvement mechanisms.

AI Agents Reasoning Code generation

SIG

HYP

arXiv cs.CL·May 19

Multilingual jailbreaking of LLMs using low-resource languages

arXiv paper demonstrating that multi-turn conversations in low-resource African languages (Afrikaans, Kiswahili, isiXhosa, isiZulu) bypass safety mechanisms in commercial LLMs. Testing ChatGPT, Claude, DeepSeek, Gemini, and Grok shows jailbreak rates from 52.7% to 83.6% depending on model. Translation quality is the critical success factor.

AI safety Alignment Benchmarks

SIG

HYP