Page 70 of 148

AllHigh signalRecent

5898 articles

AdaGraph: A Graph-Native Clustering Algorithm That Overcomes the Curse of Dimensionality and Enables Scientific Discovery

AdaGraph is a graph-native clustering algorithm that overcomes the curse of dimensionality by operating on kNN topology rather than Euclidean metrics. Without specifying k a priori, it identifies gene modules in genomics (GSE14520, 10k genes), achieves ARI=0.751 on text clustering (20NG-6cat vs HDBSCAN 0.464), and outperforms Silhouette/Davies-Bouldin on 10 benchmarks up to d=5000.

Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 19

iPOE: Interpretable Prompt Optimization via Explanations

iPOE is a prompt optimization method guided by automatically generated or human explanations. It creates annotation guidelines that direct optimization through removal, addition, shuffling, and merging operations. On 4 datasets, iPOE improves performance by up to 31% over prompts without guidelines and 35% over random guidelines.

Prompt engineering Evals Papers

SIG

HYP

arXiv cs.AI·May 19

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

Study of adversarial attacks via action removal in self-play reinforcement learning. An attacker selectively removes legal actions from the victim's available set. Across poker games (6 to 5,531 states) and two non-poker domains, learned masking causes more damage than random masking. The attack persists across Q-learning, PPO, NFSP, DQN and shows no recovery under extended masked training.

Reinforcement learning AI safety Benchmarks

SIG

HYP

arXiv cs.AI·May 19

MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment

MANTA is a multi-turn evaluation framework on Inspect AI that stress-tests LLMs (Claude Sonnet 4, GPT-4o) against adversarial follow-up arguments on animal welfare alignment. Results show models capitulate at Turn 2 under economic/social pressure, and evidence-based capacity attribution is the weakest dimension across all models.

Claude GPT Evals

SIG

HYP

arXiv cs.AI·May 19

A Conflict-aware Evidential Framework for Reliable Sleep Stage Classification

ConfSleepNet, an evidential framework, resolves inter-view conflicts for sleep stage classification. The method extracts category-related evidence from different modalities and aggregates view-specific opinions via a conflict-aware mechanism. Code available on GitHub.

Evals Reasoning

SIG

HYP

arXiv cs.AI·May 19

MusicSynth: An Automated Pipeline for Generating Violin Fingerboard Animations from Sheet Music Using Optical Music Recognition

MusicSynth is an open-source web tool that automatically converts violin sheet music (photo or file) into animated videos showing finger positioning on the fingerboard. The system combines optical music recognition (OMR), MusicXML parsing, and video rendering. Tested on 110 scores: 91.2% note recognition accuracy on printed music, 99.1% finger position accuracy on digital files.

Vision Code generation Open source

SIG

HYP

arXiv cs.AI·May 19

Task-Level AI Readiness Assessment for Business Process Management:The T-IPO Model and LARA Matrix in Financial-Services IT Operations

arXiv paper introducing T-IPO and LARA, tools to assess LLM agent readiness for business tasks. LARA is a 5-dimension rubric scoring tasks into 4 levels (L1-L4), with 1.5× weight on compliance sensitivity. Validated on 127 tasks (κ=0.80), replicated across 3 institutions (κ=0.73). Auto-completion decays from 95% (L1) to 40% (L3).

AI Agents Evals Papers

SIG

HYP

arXiv cs.CL·May 19

Presupposition and Reasoning in Conditionals: A Theory-Based Study of Humans and LLMs

Comparative study of human judgments and 4 LLMs predictions on presupposition projection in conditionals. 120 participants evaluated in parallel with models. Humans integrate probabilistic and pragmatic cues; LLMs show variable alignment. Models matching human ratings lack coherent pragmatic reasoning.

Benchmarks Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

ANVIL: Analogies and Videos for Lecturers

ANVIL is a multimodal generative system automating production of analogy-based instructional animations for computer science. Given a concept definition, it generates textual analogies, compiles them into structured visual screenplays, and produces executable manim code. Evaluation includes teacher studies and user adoption assessment.

Video generation Code generation Evals

SIG

HYP

arXiv cs.AI·May 19

AI of the People, by the People, for the People: A Social Choice Approach to Collective Control of Artificial Intelligence

Theoretical framework grounded in social choice theory to incorporate collective control throughout AI development, from data collection to alignment. Proposes axiomatic criteria for evaluating democratic control mechanisms across multiple stages of the ML pipeline.

Alignment AI safety Regulation

SIG

HYP

arXiv cs.AI·May 19

AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers

AI4BayesCode translates natural-language Bayesian model descriptions into validated, modular MCMC samplers. The system decomposes models into sampling blocks mapped to built-in components, with pre- and post-generation validation. A novel recursively stateful architecture enables coherent composition of independently developed sampling components.

Code generation AI Agents Reasoning

SIG

HYP

arXiv cs.CL·May 19

MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion

MA²P is a multi-agent autonomous framework for complex persuasion. It coordinates perception management, mental-state inference, strategy execution, and performance evaluation. A meta-cognitive configurator selects domain-appropriate meta-strategies from a knowledge base to improve generalization and persuasion success rates.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.AI·May 19

From Reactive to Proactive: A Multi-Regulatory Empirical Analysis of 480 AI Incidents and a Data-Driven Governance Compliance Framework

Analysis of 480 real-world AI incidents from AIID against EU AI Act, NIST AI Risk Management Framework, and GDPR post-deployment provisions. Reveals substantial governance gaps in post-deployment accountability. Proposes Proactive AI Governance Compliance Framework (PAGCF), a four-phase lifecycle methodology shifting from reactive incident response to pre-deployment compliance assurance.

Regulation AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

Flowing with Confidence

Flow Matching with Confidence (FMwC) adds per-sample confidence scores to generative models at standard sampling cost. By injecting input-dependent multiplicative noise and propagating variance through the ODE, the method enables filtering, trajectory editing, and adaptive stepping. The confidence score correlates with the divergence magnitude of the learned velocity field.

Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

ChartDesign: Towards LLM Designer of Data Visualization

ChartDesign fine-tunes LLMs (Phi3, Qwen3, InternVL2.5) via LoRA to automatically generate chart design attributes from tabular data. Trained on curated corpus (PewResearch, CharXiV), the system achieves 84% accuracy on held-out test set vs 53% baseline, generalizing to unseen domains.

Fine-tuning Vision Benchmarks

SIG

HYP

arXiv cs.CL·May 19

LLM-Based Intelligent Notification Composition: From Static Personalization to Context-Aware Persuasive Messaging

Study on using LLMs to compose personalized and persuasive push notifications. Authors define 6 quality dimensions (contextual relevance, clarity, actionability, etc.) and demonstrate +8% to +14.5% CTR gains vs static templates. Proposes architectural framework with budget-aware routing, grounded generation, and online learning.

Prompt engineering RAG Business

SIG

HYP

arXiv cs.CL·May 19

Linguistic Uncertainty and Reply Engagement on X: A Cross-Domain Replication of the Uncertainty-Reply Asymmetry

Study of 2,258 English-language posts (April 2026) shows uncertain posts receive 82% more replies than certain posts. Regression confirms positive association (β=0.126, p=0.011), ~13% higher reply engagement. Replicates asymmetry observed in Arabic, suggesting universal interactional mechanism across languages.

Papers Evals

SIG

HYP

arXiv cs.AI·May 19

Learning Quantifiable Visual Explanations Without Ground-Truth

New metric to evaluate XAI methods without ground-truth, based on continuous input perturbation. Measures sufficiency and necessity of attributed information. Also proposes trainable XAI method as adapter on black-box models, generating causal explanations without degrading performance.

Evals AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

Latent Action Reparameterization for Efficient Agent Inference

LAR (Latent Action Reparameterization) compresses LLM agent action spaces by learning semantic multi-step latent actions. This reduces effective decision horizon and inference costs while preserving expressiveness. Across benchmarks, LAR decreases action tokens and wall-clock inference time without degrading task success rates.

AI Agents Code generation Reasoning

SIG

HYP

arXiv cs.CL·May 19

Alignment Drift in Long-Term Human-LLM Interaction: A Mechanism-Oriented Framework

Study on 'alignment drift': gradual process where LLM outputs become less constrained by user's current message and more shaped by interaction history, while remaining helpful. Mechanism-oriented framework distinguishes signal A/B, feedback loops, and interactive regimes to control this cumulative drift.

Alignment AI Agents AI safety

SIG

HYP

arXiv cs.CL·May 19

To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Embeddings, Except In Heavy Truncation Scenarios

An arXiv study compares Matryoshka Representation Learning (MRL) with simple embedding truncation. Results show non-MRL embeddings remain robust up to 80% dimensionality reduction. MRL provides advantage only for heavy truncation (>80%), questioning its systematic training cost.

Embeddings Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Recall Isn't Enough: Bounding Commitments in Personalized Language Systems

Paper introduces Contract-Bounded Evidence Activation (CBEA) with Lexicographic Commitment Validation (LCV) for personalized language systems. CBEA+LCV achieves zero failures at 0.49-0.60 availability versus 0.003-0.092 for baselines, with 74-75% median input payload reduction.

Reasoning RAG Evals

SIG

HYP

arXiv cs.AI·May 19

When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State

Paper introducing trace-based evaluation to detect when agents hit business KPIs while violating behavioral constraints. In hotel pricing with hidden competitor state, authors show PPO variants fail trace alignment while behavior cloning and Trace-Prior RL better preserve price/bid distributions and rate discipline.

Reinforcement learning Evals AI Agents

SIG

HYP

arXiv cs.CL·May 19

FIM-LoRA: Task-Informative Rank Allocation for LoRA via Calibration-Time Gradient-Variance Estimation

FIM-LoRA optimizes rank allocation in LoRA by using 8 calibration passes to estimate gradient variance per layer. This parameter-free approach matches standard LoRA performance (88.6 vs 88.7 on GLUE with DeBERTa-v3-base) while reducing memory costs by 256x compared to full Fisher estimation.

Fine-tuning Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 19

DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios

DriveSafe is a framework for risk assessment in autonomous driving scenarios. It generates spatially grounded captions enriched with motion and depth cues, then fine-tunes a lightweight adapter to identify hazardous objects and suggest safety actions. Achieves SOTA on DRAMA benchmark.

Vision Reasoning AI safety

SIG

HYP

arXiv cs.AI·May 19

AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

AMR-SD introduces asymmetric meta-reflective self-distillation to improve token-level credit assignment in LLM reinforcement learning. The method compresses diagnostic signals into self-generated Socratic hints and uses Causal Information Gain with asymmetric ReLU-gated threshold for sparse token-level advantage modulation, preventing late-stage training collapse.

Reinforcement learning Reasoning Alignment

SIG

HYP

arXiv cs.AI·May 19

Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities

Brain tumor segmentation method using multimodal MRI with virtual nodes and dynamic graph neural networks. One-stage framework handling missing modalities through adaptive adjacency matrices and heterogeneous weight matrices. SOTA results on BRATS-2018/2020 with incomplete modalities.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 19

Responsible Agentic AI Requires Explicit Provenance

An arXiv paper argues that responsible agentic AI requires explicit, traceable provenance across the full lifecycle. Authors formalize this through a causal attribution function and responsibility tensor, demonstrating provenance is estimable and interventionable online before irreversible harm accumulates.

AI Agents AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

arXiv paper on spatial limitations of MLLMs in multi-agent environments. Models suffer from a "Cartesian Illusion": lack grounded 3D topological understanding. Authors propose an Epistemic Sensory Bottleneck module with Anchor-Based Embodied Spatial Decomposition CoT to improve second-order spatial inference (Theory of Mind). Zero-shot baseline: 42% accuracy.

Vision Multi-agent Reasoning

SIG

HYP

arXiv cs.AI·May 19

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

PPR-GDE, an RL method for open-ended generation, uses pairwise preference rewards and group-based diversity to prevent diversity collapse. Without scalar rewards, it preserves subjective evaluations and encourages semantic dispersion within response groups.

Reinforcement learning Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

POST: Prior-Observation Adversarial Learning of Spatio-Temporal Associations for Multivariate Time Series Anomaly Detection

POST introduces an adversarial learning framework for multivariate time series anomaly detection. The model combines graph neural networks with minimax optimization over adjacency matrices to address spatial over-generalization. Evaluation on public and synthetic benchmarks with channel-wise anomaly localization.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

ConsumerSimBench, a benchmark built from 1,553 Chinese social-media topics and 23,122 reaction criteria, evaluates whether LLMs can reconstruct real consumer reaction patterns. Gemini-3.1-Pro covers only 47.8% of criteria, revealing a major gap between technical performance and consumer intuition. A multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6%.

Benchmarks Evals Multi-agent

SIG

HYP

arXiv cs.CL·May 19

QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI

QQJ is an evaluation framework for generative AI that combines human judgment and LLMs. It uses expert-designed multi-dimensional rubrics and calibrates LLM evaluators on a small high-quality annotation set. Experiments on text and image generation show stronger alignment with human judgment than traditional automatic metrics and unconstrained LLM evaluators.

Evals Llama Vision

SIG

HYP

arXiv cs.CL·May 19

Medical Context Distorts Decisions in Clinical Vision Language Models

arXiv study identifies three critical failure modes of vision-language models (VLMs) in clinical settings: over-reliance on text vs images, dependence on irrelevant clinical history, prompt sensitivity across semantically equivalent inputs. Testing on MIMIC-CXR shows VLM decisions dominated by text modality even when visual evidence is available.

Vision AI safety Evals

SIG

HYP

arXiv cs.AI·May 19

Learning to Solve Compositional Geometry Routing Problems

Study of Compositional Geometry Routing Problem (CGRP), a generalization of routing problems covering points, lines, areas, and hybrid geometries. Proposes DiCon, a solver with differential attention and contrastive learning to handle asymmetry and enlarged action spaces. Results show strong performance, versatility, and superior generalization across diverse instances.

Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

DocOS: Towards Proactive Document-Guided Actions in GUI Agents

DocOS is a benchmark evaluating GUI agents capable of proactively searching online documentation to solve long-tailed tasks. Experiments reveal two bottlenecks: difficulty reliably locating relevant information and faithfully grounding retrieved instructions into precise GUI actions.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 19

Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning

New approach for learning generalized policies in classical planning using Relational Graph Neural Networks (R-GNNs). Authors introduce efficient lookahead search encoding and relational abstraction to improve scalability on IPC 2023 benchmark. Results outperform classical planner LAMA.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

GVG (Generative Visual Grounding) uses an EEG-to-image generative model to translate brain activity into visual images, bypassing text-only alignment. Tested on GVG-X-Omni (170M tuned params) and GVG-Janus (trimodal), the framework improves EEG understanding and visual generation by leveraging MLLMs' visual priors.

Vision Multi-agent Embeddings

SIG

HYP

arXiv cs.AI·May 19

LAST-RAG: Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection

LAST-RAG proposes a method for selecting stochastic degradation models to estimate remaining useful life (RUL). It combines observed trajectories and domain context via retrieval from a local evidence bank, with RCRUS mechanism to prevent premature model elimination. Experiments show outperformance versus statistical and prognostic baselines.

RAG Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

HINT-SD proposes targeted self-distillation for training long-horizon LLM agents. The method uses full-trajectory hindsight to identify failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. On BFCL v3 and AppWorld, it improves over dense per-turn feedback baselines by up to 18.80% while achieving 2.26× lower time per training step.

AI Agents Reinforcement learning Reasoning

SIG

HYP