Page 53 of 145

AllHigh signalRecent

5792 articles

Error as a Lens: Probing LLM Reasoning through Synthetic Misconception Generation

Framework to generate targeted synthetic errors with LLMs aligned to cognitive taxonomy (revised Bloom's). A Generation Agent drafts erroneous solutions, an Examination Agent validates consistency with specified error mode. Tested on TheoremQA, shows generating authentic errors is substantially harder than producing arbitrary wrong answers.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.CL·May 29

Large language models reorganize representational geometry during in-context learning

arXiv paper on representational geometry during in-context learning (ICL) in LLMs. Researchers show ICL performance correlates with task representational structure and successful ICL involves geometric reorganization increasing online separability. LLM behavior follows a prototype-like algorithm.

Reasoning Papers

SIG

HYP

arXiv cs.CL·May 29

A comparative study of transformer-based embeddings for topic coherence

Comparative study of 7 transformer models (MiniLM to LLaMA-2, 22M to 13B parameters) for topic modeling via BERTopic. Finding: model size has negligible impact on topic quality measured by coherence and divergence metrics. Smaller models achieve comparable performance to larger ones.

Embeddings Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 29

Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models

M2R (Micro-Macro Retrieval) is a retrieve-while-generate framework reducing hallucinations in long-form LLM generation. It combines macro retrieval (external evidence) and micro retrieval (key information from reasoning) to maintain proximity between factual data and outputs. Trained via reinforcement learning with rule-based rewards.

RAG Reinforcement learning

SIG

HYP

arXiv cs.CL·May 29

Lightweight Multimodal LLM-Enabled Cost-Effective Defect Grading of Power Transmission Equipment

Defect grading framework for power transmission equipment using MLLM. In-context learning on commercial models, chain-of-thought Q&A generation to reduce manual annotation, then fine-tuning Qwen3-VL-8B via LoRA. SOTA on three grading tasks.

Qwen Vision Fine-tuning

SIG

HYP

arXiv cs.LG·May 29

Knowledge Offloading: Decomposing LLMs into Sparse Backbones and Memory Modules

KOFF decomposes LLMs into sparse shared backbones and domain-specific external memory modules. On Llama and Qwen (3B-8B), the framework preserves performance at 12% global sparsity using LoRA adapters and learned KV caches, while pruning without memories degrades sharply.

Llama Qwen Fine-tuning

SIG

HYP

arXiv cs.LG·May 29

Parallel Adaptive Multi-Objective Evolutionary Learning of Discretized Bayesian Network Classifiers for Clinical Data

Baymex, a multi-objective evolutionary algorithm, learns discretized Bayesian networks for clinical classification. Parallelized on 16 cores (54× speedup), it optimizes cross-entropy and BIC complexity. On real datasets (RADCURE, SUPPORT), it matches or outperforms decision trees, logistic regression, and random forests while producing interpretable models.

Benchmarks

SIG

HYP

arXiv cs.LG·May 29

Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

Q-ALIGN DT aligns conditioned sequence models by ensuring the Q-value of the output policy matches the input return-to-go (RTG). The method uses a Q function for dense guidance and RTG-perturbation fine-tuning. Results: improved controllability on D4RL benchmark and generalization to velocity-tracking tasks where prior methods fail.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·May 29

CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models

CosmicFish-HRM is a compact model with a Hierarchical Reasoning Module (HRM) that dynamically allocates computational effort during inference. The model learns when to halt based on input complexity, combining high-level and low-level reasoning cycles with Grouped Query Attention, RoPE, and SwiGLU. Results show non-uniform reasoning behavior adapted to tasks and inputs.

Reasoning Fine-tuning Benchmarks

SIG

HYP

arXiv cs.LG·May 29

Cycle-Space Informed Detection of Autoencoded Blind False Data Injection Attacks on Power Systems

Detection of False Data Injection Attacks on power systems using cycle-space informed detection. Authors propose a topology-aware Cycle-Space Detector (CSD) robust against autoencoder-based attacks that exploit the Jacobian null space, leveraging network topology and Minimum Cycle Basis to enhance detection with optimal generalization error on IEEE 14-, 30-, 57-, 118-bus systems.

AI safety Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 29

Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling

RACE-Sched, an asynchronous multi-agent framework, solves dynamic scheduling by decoupling real-time execution (symbolic heuristics) from long-horizon reasoning (LLM). A semantic rule repository of validated heuristics improves transferability across problem scales. Outperforms Deep RL and LLM baselines on GEN-Bench, MK-Bench, JMS-Bench.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.LG·May 29

Context Distillation as Latent Memory Management

Context distillation reformulated as latent memory management problem. Each context distilled into independent LoRA adapter forming modular memory bank. Self-Gating mechanism decides whether to activate latent memories. Cache sharing reduces inference overhead.

Fine-tuning Reasoning Infrastructure

SIG

HYP

arXiv cs.AI·May 29

Differentiable Belief-based Opponent Shaping

D-BOS (Differentiable Belief-based Opponent Shaping) is a MARL method that shapes opponents by differentiating through k-step softmax-Bayes belief dynamics. Unlike existing approaches, it treats belief state as the shaping target rather than parameters or policies. Results: outperforms PPO and BBM in hidden-role games, with largest gains in mixed-motive settings.

Multi-agent Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 29

Mind Your Tone: Does Tone Alter LLM Performance?

Study on prompt tone impact on LLM performance. Tests on ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash/Lite using 50 base questions and 570 MMLU questions (57 subjects) in 5-7 tone variants. Results: tonal effects are systematic but highly model-dependent, with significant accuracy variations across subjects.

Prompt engineering Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 29

Wait! There's a Way Out: A Decision Mechanism for Forecasting Conversational Derailment

Method to forecast conversational derailment (personal attacks) in real-time. Decouples alert triggering from derailment likelihood estimation using forward-looking simulations to assess plausible recovery paths. Reduces false positives without sacrificing forecasting accuracy.

Papers Reasoning AI safety

SIG

HYP

arXiv cs.CL·May 29

Slogans or Stance? A Label-Light Diagnostic for Entrepreneurial-Discourse Measurement on Chinese SOE Speeches

Diagnostic tool for measuring constructs like "entrepreneurial spirit" in Chinese state-owned enterprise speeches. On 80 speeches from SOE leaders, authors test LDA, dictionary scorers, and Qwen3.5:9b. The LLM reaches d=1.09 in paired contrast, but half the effect stems from speaker idiolect. Corpus of 2,190 segments and slogan lexicon released.

Benchmarks Evals Qwen

SIG

HYP

arXiv cs.CL·May 29

Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception

Study of persona effects on explanations generated by multimodal LLM agents in urban perception. Analysis of 59,808 annotations from 1,200 persona-conditioned agents: captions show strong convergence, justifications display systematic variation tied to socioeconomic and political attributes, perception tags show no significant persona-related differences.

Vision AI Agents Prompt engineering

SIG

HYP

arXiv cs.AI·May 29

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

Redpanda introduces an Agentic Data Plane architecture using out-of-band metadata channels to enforce security policies, data classifications, and behavioral constraints outside the agent's read/write path. These channels prevent hallucinations and adversarial manipulation while maintaining tamper-proof audit trails. Demonstrated with a multi-agent portfolio rebalancing system.

AI Agents Multi-agent AI safety

SIG

HYP

arXiv cs.CL·May 29

Bosses, Kings, and the Commons: Cooperation Under Power Asymmetry in LLM Societies

SovSim, a multi-agent simulation framework, evaluates how 11 state-of-the-art LLMs manage shared resources under asymmetric power structures. Finding: introducing an agent with disproportionate power (boss/king) causes 87.3% degradation in survival rate and cooperation breakdowns compared to symmetric settings.

Multi-agent AI Agents Benchmarks

SIG

HYP

arXiv cs.CL·May 29

LLMBridge: An LLM Pipeline for End-to-end Referential Bridging Resolution in English

LLMBridge is an LLM-based system for end-to-end referential bridging resolution in English. The pipeline combines heuristic pre/post-processing with LLM natural language inference capabilities. Evaluated on ISNotes, BASHI, and GUMBridge, it outperforms previous state-of-the-art systems on all three datasets in both end-to-end and gold anaphor settings.

Papers Benchmarks Reasoning

SIG

HYP

Vercel AI Blog·May 29

Protecting against token theft

Vercel warns of AI inference theft: a single frontier model request costs ~$2, creating high-margin attack opportunities. Rate limits and session-based auth are insufficient; Vercel proposes BotID to verify every AI request individually and prevent tens of thousands in losses.

AI safety Infrastructure Business

SIG

HYP

arXiv cs.CL·May 29

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

Researchers introduce a Behavioral Specification as an interpretive layer to align AI decisions with user preferences. Tested on 14 autobiographical corpora, it improves representational accuracy at ~25x lower context cost than raw corpus while reducing model hedging. Effective on interpretation-required questions; less helpful on recall-based tasks.

Alignment RAG AI Agents

SIG

HYP

arXiv cs.CL·May 29

The Trust Paradox: How CS Researchers Engage LLM Leaderboards

Qualitative study of 8 AI researchers reveals a paradox: they distrust LLM leaderboards yet use them as decision aids. Peer networks dominate model selection. NLP researchers face SOTA pressure absent in HCI/Systems. Universal demand: cost transparency.

Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 29

From Data to Insights: Exploring Program-of-Thoughts Prompting for Chart Summarization

Paper introduces Program-of-Thoughts prompting for chart summarization: VLMs generate Python programs to derive valid summary statistics instead of direct text. Proposes chart-to-dictionary auxiliary task. Results match existing methods on semantic and factual metrics.

Prompt engineering Vision Reasoning

SIG

HYP

arXiv cs.CL·May 29

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

GPF-LiveNews is a streaming evaluation protocol to audit how LLMs frame emerging news events for different audiences. Tested on 23 models across 12 monitoring runs, it measures semantic and sentiment variations across 42 identity labels. Results show Policy/Action prompts produce strongest semantic movement, while sentiment variation remains flat across dimensions.

Evals AI safety Alignment

SIG

HYP

arXiv cs.CL·May 29

Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning

Thoughts-as-Planning formalizes reasoning chain optimization as sequential decision-making over latent semantic space. The framework learns a latent world model simulating effects of reasoning chain edits on outputs, supporting multi-scale edits (token, segment, instruction) via gradient descent or reinforcement learning planning.

Reasoning Reinforcement learning Prompt engineering

SIG

HYP

arXiv cs.AI·May 29

Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

STHTD-MP, a new off-policy temporal-difference method, replaces the covariance metric with the behavior-policy Bellman matrix in the primal-dual saddle-point formulation. Formal convergence analysis and spectral comparison with GTD2-MP show potential gains on benchmarks (Random Walk, Boyan Chain).

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 29

Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction

Theoretical paper on stabilizing off-policy temporal-difference learning with function approximation. Proposes BA-TDC and BA-TDRC, replacing TDC's auxiliary matrix with behavior Bellman matrix. Linear analysis with convergence proof under Hurwitz stability condition; experiments on Markov chains and classical counterexamples.

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 29

Assessing Dutch Syllabification Algorithms and Improving Accuracy by Combining Phonetic and Orthographic Information through Deep Learning

Comparative assessment of four Dutch syllabification algorithms (Brandt Corstius, Liang, Trogkanis-Elkan CRF, and a novel deep learning model). The deep learning model combining phonetic and orthographic information achieves 99.65% word accuracy (+0.14% improvement over literature). Data-driven algorithms outperform knowledge-based approaches.

Papers Benchmarks Code generation

SIG

HYP

arXiv cs.CL·May 29

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

Comparative study of 9 ASR models (Whisper, Parakeet, Wav2Vec2) on child speech in Dutch. Fine-tuned Whisper-medium achieves 5.54% WER on JASMIN and 70.37% on DART. An utterance-level selection method identifies 42% (JASMIN) and 18.1% (DART) of utterances as correctly pronounced with ≥98.3% precision, reducing manual verification needs.

Benchmarks Voice Evals

SIG

HYP

arXiv cs.AI·May 29

BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation

BEAMS establishes benchmarks to evaluate AI tools for modeling and simulation. The open-source sd ai project tests multiple LLMs on tasks including causal translation, model iteration, and causal reasoning. Results show AI tools perform better at qualitative discussion than causal reasoning and quantitative error fixing.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.AI·May 29

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis

Researchers propose a disagreement-based audit pipeline to evaluate LLMs deployed by federal agencies for categorizing public comments. Analyzing 1,260 USDA comments across four LLMs, inter-model thematic divergence exceeds within-model prompt variation, and human annotators introduce interpretive framings absent from the ensemble's collective output.

Evals Reasoning Regulation

SIG

HYP

arXiv cs.LG·May 29

Molecular Lead Optimization via Agentic Tool Planning

TRACE, an LLM-reasoning agent for drug lead optimization, formulates tool selection as sequential decision-making over action trajectories. The approach improves ADMET properties while preserving critical molecular substructures, outperforming baselines on multiple optimization tasks.

AI Agents Reasoning Papers

SIG

HYP

arXiv cs.LG·May 29

Self-Play Reinforcement Learning under Imperfect Information in Big 2

Self-play RL study in Big 2, a four-player imperfect-information card game. PPO outperforms Q-learning, SARSA, and Monte Carlo Q-approximation against random, greedy, and heuristic opponents. Moderate entropy regularization and current-policy self-play improve performance in this controlled multiplayer setting.

Reinforcement learning Multi-agent Benchmarks

SIG

HYP

arXiv cs.AI·May 29

Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

arXiv paper proposing multi-agent architecture with semantic memory and caching to mitigate LLM hallucinations. Three-stage pipeline (FrontEndAgent, SecondLevelReviewer, ThirdLevelReviewer) evaluated on 310 prompts. Results: THS reduction of -31.3% to -35.9%, 47.3% cache hit rate, 47% reduction in LLM calls. No retraining required.

AI Agents Multi-agent AI safety

SIG

HYP

arXiv cs.CL·May 29

A Modular Architecture for Typologically Controlled Lexicon Generation

Modular framework for generating pronounceable, typologically plausible artificial lexicons. Samples phoneme inventories from PHOIBLE, applies three phonological grammars (deterministic, OT, MaxEnt), and assigns meanings via Swadesh-Leipzig-Jakarta ontology. Evaluation on character n-gram perplexity and KL divergence: probabilistic grammars outperform baselines on 100-5,000 word forms.

Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 29

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

Method to create linear probes detecting concepts in LLM embeddings. Authors define a process: concept delineation via contrastive datasets, layer-wise probe training, tracking across large contexts. Tested on 4 concepts and 3 different LLMs. Goal: scalable monitoring of new models.

Embeddings Evals

SIG

HYP

arXiv cs.LG·May 29

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

COM, a strategy for time series LLMs, integrates geometric constraints into token embedding initialization and training. It preserves inherent continuity and ordinality of time series, improving performance across multiple time series analysis benchmarks.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 29

TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models

TaxDistill applies knowledge distillation to improve metagenomic taxonomic annotation. GenomeOcean, a 500M-parameter genomic foundation model, generates soft labels to train a lightweight student network, reducing noise from initial retrieval tools. On 7 CAMI2 datasets, TaxDistill improves MMseqs2's F1 score from 0.763 to 0.941 on the Gastrointestinal dataset.

Papers Fine-tuning Benchmarks

SIG

HYP

arXiv cs.LG·May 29

Balancing Multimodal Learning through Label Space Reshaping

New BMLR method balances multimodal learning by reshaping label space to equalize mapping difficulty across modalities. Addresses modality imbalance where faster-converging modalities dominate optimization. Novel label-side approach rather than gradient adjustment.

Papers Benchmarks Vision

SIG

HYP