Page 31 of 192

AllHigh signalRecent

7679 articles

AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

AASIST3 enhances speech deepfake detection by integrating Kolmogorov-Arnold Networks (KAN) into the AASIST framework. The model achieves minDCF=0.5357 (closed) and 0.1414 (open) on ASVspoof 2024, doubling prior performance. Code released on HuggingFace.

Voice AI safety Benchmarks

SIG

HYP

arXiv cs.AI·May 19

DocReward: A Document Reward Model for Structuring and Stylizing

DocReward is a document reward model evaluating structure and style of professional documents, independent of textual quality. Trained on DocPair (117K document pairs, 32 domains), it outperforms GPT-4 by 14.6 percentage points and effectively guides agents via RL toward higher structural and stylistic professionalism.

Reinforcement learning AI Agents Evals

SIG

HYP

arXiv cs.AI·May 19

Attractor-Vascular Coupling Theory: Formal Grounding and Empirical Validation for AAMI-Standard Cuffless Blood Pressure Estimation from Smartphone Photoplethysmography

Attractor-Vascular Coupling Theory (AVCT): mathematical framework showing cardiac attractor geometry encodes blood pressure information. Calibrated LightGBM model on smartphone PPG achieves MAE 2.05 mmHg (SBP) and 1.67 mmHg (DBP) in strict leave-one-subject-out cross-validation (46 subjects, 29,684 windows), meeting AAMI/IEEE SP10 criteria. PPG-only ablation matches ECG+PPG within 0.05 mmHg.

Papers Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 19

Language-Switching Triggers Take a Latent Detour Through Language Models

Circuit analysis of a backdoor in an 8B model: a 3-word Latin trigger redirects English output to French. The circuit operates in 3 phases via attention heads, propagates through a subspace orthogonal to natural language-identity directions, then converts via MLP. A single serial bottleneck position controls the entire flow.

AI safety Alignment Papers

SIG

HYP

arXiv cs.AI·May 19

Property-Guided LLM Program Synthesis for Planning

Property-guided program synthesis approach reduces LLM costs by replacing simple numeric scores with formal property verification. When a property is violated, the system provides concrete counterexamples to guide repair. On PDDL planning domains, this method generates 7× fewer programs and drastically reduces evaluation costs while improving solution quality.

Code generation Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

Dual-process memory architecture for scientific agents: decouples episodic window (10 messages) from semantic consolidation (3 tokens/message). Evaluation on 15,000 messages across 6 LLMs (OpenAI, Anthropic, Google): maintains 70-85% accuracy at 10,000 messages with 62% fewer tokens. Identifies trade-offs: Dual Process excels at numeric/temporal queries, RAG for historical retrieval.

AI Agents Reasoning RAG

SIG

HYP

arXiv cs.CL·May 19

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

Red-Bandit is a red-teaming framework that adapts real-time specialized LoRA experts for different attack styles (manipulation, slang) via reinforcement learning. A multi-armed bandit algorithm dynamically selects the optimal expert based on target model response safety. State-of-the-art results on AdvBench with more readable prompts.

AI safety Fine-tuning Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

GraphMind: From Operational Traces to Self-Evolving Workflow Automation

GraphMind automates complex operational workflows by extracting action graphs from human resolution traces, then executes them via a multi-agent engine with LLM reasoning. An adaptive reinforcement mechanism (ATR) optimizes successful paths. Deployed across 4 cloud services, the system outperforms a Trace-RAG baseline with a 4.95/5 expert review score.

Multi-agent RAG Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

ClawForge is a benchmark framework for CLI agents testing persistent state and conflict handling. 17 scenarios, 6 ability categories. Seven frontier models evaluated: best score 45.3%, widest gap 17-90% driven by whether agents inspect existing state before acting.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis

MEMOIR, a memory-guided tree-search framework, automatically synthesizes solvers for combinatorial optimization using LLMs. With a two-level memory hierarchy (branch-local and global), it achieves 96.7% solution validity across 7 problems (scheduling, routing, packing), outperforming baselines by 9.2 points and reducing run-to-run validity variance by over an order of magnitude.

AI Agents Reasoning Code generation

SIG

HYP

arXiv cs.CL·May 19

CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models

CarbonScaling is a hardware-aware analytical framework modeling carbon footprint of frontier LLM training. It integrates neural scaling laws, distributed training strategies, accelerator modeling, and operational/embodied carbon accounting. Source code released on GitHub.

Benchmarks Papers Infrastructure

SIG

HYP

arXiv cs.CL·May 19

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

BacktestBench is the first large-scale benchmark for automated quantitative backtesting, containing 18,246 annotated QA pairs across 6 million real market records. AutoBacktest, a multi-agent system, translates natural language strategies into reproducible backtests via a Summarizer, SQL Retriever, and Python Coder. Evaluation on 23 mainstream LLMs.

Benchmarks Multi-agent Code generation

SIG

HYP

arXiv cs.AI·May 19

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Systematic study of MoE model compression (Qwen3-Next-80A3B → 23A2B) via pruning and distillation at pretraining scale. Pruning outperforms training from scratch, multi-token prediction (MTP) distillation improves performance, and progressive schedules beat one-shot compression.

Qwen Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

Researchers propose the Refusal Index (RI), a metric measuring LLMs' ability to refuse questions beyond their knowledge. RI correlates refusal probability with error probability using Spearman's rank correlation. Testing across 16 models and 5 datasets shows LLMs refuse unreliably despite high factual accuracy.

Evals AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

EnactToM is an evolving benchmark with 300 multi-agent embodied tasks in 3D household environments with partial observability. It tests functional Theory of Mind—acting optimally on implicit beliefs—rather than literal belief questions. All seven frontier models score 0.0% on hard task completion, with 93% of failures traced to epistemic coordination breakdowns.

Multi-agent Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Embodied Task Planning via Graph-Informed Action Generation with Large Language Models

GiG, a planning framework for embodied agents, uses Graph-in-Graph architecture with GNN to encode environmental states and structure experience memory. A bounded lookahead module enhances planning via symbolic transition logic. Evaluated on Robotouille and ALFWorld, GiG outperforms baselines with +22% to +37% Pass@1 gains.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Language models fail at extended rule following

Language models fail to reliably apply simple rules over long sequences. Test on 126 model variants: all models cannot count above a model-dependent threshold. Failures are abrupt and persist despite increasing model size and computation. Mechanistic probing shows models use finite internal states to simulate counting, exhausting them beyond threshold.

Reasoning Benchmarks AI Agents

SIG

HYP

arXiv cs.AI·May 19

MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

MolClaw is an autonomous agent with a three-tier hierarchical architecture (70 skills) for drug molecule evaluation, screening, and optimization. It integrates 30+ specialized resources and achieves state-of-the-art performance on MolBench, a benchmark spanning 8 to 50+ sequential tool calls. Gains concentrate on structured workflow orchestration rather than ad hoc scripting.

AI Agents Multi-agent Benchmarks

SIG

HYP

arXiv cs.AI·May 19

CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

CBT-Audio is a dataset of 1,802 patient turns from 96 public CBT recordings with expert-validated distress labels. Evaluation of 10 open-source audio language models shows audio improves distress estimation over text alone in 8/10 model families, with strongest gains when verbal content and vocal delivery diverge.

Benchmarks Voice Evals

SIG

HYP

arXiv cs.AI·May 19

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

CheeseBench evaluates 6 open-weight LLMs (3B-72B) on 9 behavioral neuroscience paradigms (Morris water maze, T-maze, etc.). Qwen2.5-VL-7B achieves 52.6% success on ASCII vs 32.1% random and 78.9% rodent baselines. Scaling >7B yields diminishing returns; longer context and chain-of-thought degrade performance.

Benchmarks Reasoning Vision

SIG

HYP

arXiv cs.AI·May 19

Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training

Learning-Zone Energy (LZE) is an online data selection framework for RL post-training of LLMs. Tested on Qwen 1.5B-8B across GSM8K and MATH, it retains 40% of training data per step while matching full-data baselines, with OOD gains of +45.9% on AIME25 and 36% FLOP reduction.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks

BoLT is an open-source benchmark for black-box optimization applied to LLMs. It covers hyperparameters, data mixtures, and prompts via lightweight surrogate models fitted to thousands of real experiments. Benchmarking Bayesian Optimization and BBO methods reveals gaps in existing approaches.

Benchmarks Open source Papers

SIG

HYP

arXiv cs.CL·May 19

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

Study showing that unlearning in LLMs merely suppresses information at surface level—models recover original behavior through minimal fine-tuning. Authors introduce representation-level analysis framework (PCA, CKA, Fisher information) to assess genuine data erasure and identify four forgetting regimes based on reversibility and catastrophicity.

Papers AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

Neurosymbolic architecture with ontologies (Role, Domain, Interaction) for enterprise LLM agents. Controlled experiment (1,800 runs, Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B): ontology-constrained agents outperform ungrounded agents on metric accuracy and role consistency (p < .001). 2x greater lift in localized domains (Vietnam) where LLM training coverage is weak.

AI Agents Claude Reasoning

SIG

HYP

arXiv cs.CL·May 19

Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback

Self-improving CAD agents using finite element analysis (FEA) as feedback. Codex (GPT-5.5) and Claude Code (Opus-4.7) models produce no valid artifacts on first attempt; only ~20% of requirements met. Two supervision signals (text blueprint schema and 21-view renderer) improve iterative loops: Box-IoU rises from 0.444 to 0.592 on S2O.

AI Agents Code generation Reasoning

SIG

HYP

arXiv cs.AI·May 19

SLEIGHT-Bench: A Benchmark of Evasion Attacks Against Agent Monitors

SLEIGHT-Bench is a benchmark of 40 evasion attacks against LLM-based coding agent monitors. Claude Opus 4.6 with extended thinking catches only 23% of attacks (24/40 never detected). Evasion strategies exploit model priors, instruction ambiguity, and state manipulation.

AI Agents AI safety Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Supervising the search process produces reliable and generalizable information-seeking agents

RAG-Gym, a framework supervising the search process rather than final answers, improves autonomous search agents. Re²Search++ uses process supervision and reasoning reflection to generate higher-quality queries, achieving significant gains on multi-hop benchmarks with better out-of-domain generalization.

RAG AI Agents Reasoning

SIG

HYP

arXiv cs.AI·May 19

CyberCorrect: A Cybernetic Framework for Closed-Loop Self-Correction in Large Language Models

CyberCorrect formalizes LLM self-correction as a closed-loop control system. A tri-modal error detector (self-consistency, verbalized confidence, logic-chain verification) and type-directed correction controller achieve 79.8% accuracy on CyberCorrect-Bench (440 reasoning tasks), +6.2pp over existing methods, reducing overshoot by 41% via convergence control.

Reasoning Evals Papers

SIG

HYP

arXiv cs.CL·May 19

FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information

FinTagging is a benchmark for evaluating LLMs on extracting and tagging financial data with XBRL. It decomposes the task into two stages: FinNI (extracting numeric entities) and FinCL (mapping to the full US GAAP taxonomy). Testing shows models extract well but struggle with fine-grained concept linking across 10k+ concepts.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·May 19

ShareChat: A Dataset of Chatbot Conversations in the Wild

ShareChat is a corpus of 142,808 conversations (660,293 turns) collected from ChatGPT, Perplexity, Grok, Gemini, and Claude between April 2023 and October 2025. The dataset preserves native affordances (citations, reasoning traces, code artifacts) across 95 languages and enables analysis of cross-platform differences in intent satisfaction, citation strategies, and response latency.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.CL·May 19

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

AgentKernelArena is an open-source benchmark for evaluating AI agents on GPU kernel optimization. It contains 196 tasks (HIP-to-HIP, Triton-to-Triton, PyTorch-to-HIP) and tests generalization on unseen configurations. Tested agents (Cursor Agent, Claude Code, Codex) achieve speedups up to 6.89x, but show generalization weaknesses on PyTorch-to-HIP.

AI Agents Code generation Benchmarks

SIG

HYP

arXiv cs.AI·May 19

CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean

CAM-Bench is a Lean 4 benchmark of 1,000 computational and applied mathematics problems (optimization, numerical linear algebra, numerical analysis) adapted from textbooks with locally recovered context via dependency-recovery pipeline. Evaluation of LLMs and formalization agents reveals failures in tracking local assumptions and long-horizon control in Lean.

Benchmarks Reasoning Code generation

SIG

HYP

arXiv cs.CL·May 19

PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning

PEGRL is a two-stage RL framework for LLM-based machine translation. It uses post-editing as an auxiliary task to stabilize training and guide optimization. Tests on EN→FI, EN→TR, EN↔ZH show consistent gains; EN→TR achieves performance comparable to DeepSeek-V3.2 on COMET-KIWI.

Reinforcement learning Code generation Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Enhancing Table Reasoning with Deterministic Table-State Rewards

TABROUGE, a deterministic reward metric based on Longest Common Subsequence, improves LLM table reasoning without training. RE-TAB, a plug-and-play framework using TABROUGE, gains 26.7 pp across six backbones and three benchmarks, reducing test-time scaling samples by 33%.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

ChemVA framework advances LLM understanding of chemical reaction diagrams by addressing visual and semantic bottlenecks. Uses Visual Anchor mechanism for functional group detection and semantic alignment to activate chemical reasoning. Achieves 92.0% structural recognition accuracy on OCRD-Bench with ~20 percentage point gains across 9 diverse LLMs.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)

Stream2LLM is an LLM serving system that reduces time-to-first-token (TTFT) by overlapping context retrieval with inference. It handles two modes: append (progressive accumulation) and update (iterative refinement). Evaluation on real workloads shows up to 11x TTFT improvement.

Infrastructure Reasoning RAG

SIG

HYP

arXiv cs.AI·May 19

When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

Study reveals a safety vulnerability in personalized dialogue agents: long-term memory biases intent inference and legitimizes harmful queries. PS-Bench benchmark shows personalization increases attack success rates by 15.8%–243.7% versus stateless baselines. A lightweight detection-reflection method is proposed to mitigate this safety degradation.

AI safety AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·May 19

ClawGym: A Scalable Framework for Building Effective Claw Agents

ClawGym is a framework for developing agents capable of executing multi-step workflows over local files and persistent tools. It includes ClawGym-SynData (13.5K synthesized tasks), models fine-tuned via supervised learning, and ClawGym-Bench (200 evaluation instances). Code and resources released.

AI Agents Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study

RL-trained Lean theorem provers suffer mode-collapse at inference: doubling sampling from k=32 to k=64 on miniF2F-test with DeepSeek-Prover-V1.5-RL solves zero additional theorems (42/244). Fixed structural diversity of 15 tactic skeletons recovers +45% relative improvement at k=16 (+12.3±4.2 theorems). Phenomenon is RL-specific and orthogonal to scaling.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop

MADP is a multi-agent architecture for enterprise document automation, combining deep learning classification and LLM extraction with human validation. Deployed on 955 real documents, it achieves 97% full-pipeline automation and reduces FTE requirements by 70%. 98.5% document-level accuracy with human-in-the-loop; 69% CO2 reduction vs manual processing.

Multi-agent AI Agents Code generation

SIG

HYP