Page 54 of 146

AllHigh signalRecent

5828 articles

Bosses, Kings, and the Commons: Cooperation Under Power Asymmetry in LLM Societies

SovSim, a multi-agent simulation framework, evaluates how 11 state-of-the-art LLMs manage shared resources under asymmetric power structures. Finding: introducing an agent with disproportionate power (boss/king) causes 87.3% degradation in survival rate and cooperation breakdowns compared to symmetric settings.

Multi-agent AI Agents Benchmarks

SIG

HYP

arXiv cs.CL·May 29

LLMBridge: An LLM Pipeline for End-to-end Referential Bridging Resolution in English

LLMBridge is an LLM-based system for end-to-end referential bridging resolution in English. The pipeline combines heuristic pre/post-processing with LLM natural language inference capabilities. Evaluated on ISNotes, BASHI, and GUMBridge, it outperforms previous state-of-the-art systems on all three datasets in both end-to-end and gold anaphor settings.

Papers Benchmarks Reasoning

SIG

HYP

Vercel AI Blog·May 29

Protecting against token theft

Vercel warns of AI inference theft: a single frontier model request costs ~$2, creating high-margin attack opportunities. Rate limits and session-based auth are insufficient; Vercel proposes BotID to verify every AI request individually and prevent tens of thousands in losses.

AI safety Infrastructure Business

SIG

HYP

arXiv cs.CL·May 29

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

Researchers introduce a Behavioral Specification as an interpretive layer to align AI decisions with user preferences. Tested on 14 autobiographical corpora, it improves representational accuracy at ~25x lower context cost than raw corpus while reducing model hedging. Effective on interpretation-required questions; less helpful on recall-based tasks.

Alignment RAG AI Agents

SIG

HYP

arXiv cs.CL·May 29

The Trust Paradox: How CS Researchers Engage LLM Leaderboards

Qualitative study of 8 AI researchers reveals a paradox: they distrust LLM leaderboards yet use them as decision aids. Peer networks dominate model selection. NLP researchers face SOTA pressure absent in HCI/Systems. Universal demand: cost transparency.

Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 29

From Data to Insights: Exploring Program-of-Thoughts Prompting for Chart Summarization

Paper introduces Program-of-Thoughts prompting for chart summarization: VLMs generate Python programs to derive valid summary statistics instead of direct text. Proposes chart-to-dictionary auxiliary task. Results match existing methods on semantic and factual metrics.

Prompt engineering Vision Reasoning

SIG

HYP

arXiv cs.CL·May 29

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

GPF-LiveNews is a streaming evaluation protocol to audit how LLMs frame emerging news events for different audiences. Tested on 23 models across 12 monitoring runs, it measures semantic and sentiment variations across 42 identity labels. Results show Policy/Action prompts produce strongest semantic movement, while sentiment variation remains flat across dimensions.

Evals AI safety Alignment

SIG

HYP

arXiv cs.CL·May 29

Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning

Thoughts-as-Planning formalizes reasoning chain optimization as sequential decision-making over latent semantic space. The framework learns a latent world model simulating effects of reasoning chain edits on outputs, supporting multi-scale edits (token, segment, instruction) via gradient descent or reinforcement learning planning.

Reasoning Reinforcement learning Prompt engineering

SIG

HYP

arXiv cs.CL·May 29

Assessing Dutch Syllabification Algorithms and Improving Accuracy by Combining Phonetic and Orthographic Information through Deep Learning

Comparative assessment of four Dutch syllabification algorithms (Brandt Corstius, Liang, Trogkanis-Elkan CRF, and a novel deep learning model). The deep learning model combining phonetic and orthographic information achieves 99.65% word accuracy (+0.14% improvement over literature). Data-driven algorithms outperform knowledge-based approaches.

Papers Benchmarks Code generation

SIG

HYP

arXiv cs.CL·May 29

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

Comparative study of 9 ASR models (Whisper, Parakeet, Wav2Vec2) on child speech in Dutch. Fine-tuned Whisper-medium achieves 5.54% WER on JASMIN and 70.37% on DART. An utterance-level selection method identifies 42% (JASMIN) and 18.1% (DART) of utterances as correctly pronounced with ≥98.3% precision, reducing manual verification needs.

Benchmarks Voice Evals

SIG

HYP

arXiv cs.CL·May 29

A Modular Architecture for Typologically Controlled Lexicon Generation

Modular framework for generating pronounceable, typologically plausible artificial lexicons. Samples phoneme inventories from PHOIBLE, applies three phonological grammars (deterministic, OT, MaxEnt), and assigns meanings via Swadesh-Leipzig-Jakarta ontology. Evaluation on character n-gram perplexity and KL divergence: probabilistic grammars outperform baselines on 100-5,000 word forms.

Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 29

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

Method to create linear probes detecting concepts in LLM embeddings. Authors define a process: concept delineation via contrastive datasets, layer-wise probe training, tracking across large contexts. Tested on 4 concepts and 3 different LLMs. Goal: scalable monitoring of new models.

Embeddings Evals

SIG

HYP

arXiv cs.LG·May 29

Ensemble Score Filtering for Real-Data Energy Consumption Forecast Correction

Energy consumption forecast correction method combining a pretrained spatio-temporal model with Ensemble Score Filter (EnSF). EnSF uses score-based diffusion models to assimilate partial and noisy observations. Real-data experiments show EnSF outperforms Ensemble Kalman Filter under nonlinear observation settings.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.LG·May 29

Moment Matching Q-Learning

MoMa QL leverages maximum mean discrepancy (MMD) to accelerate inference of score-based and flow-based generative models in RL. The method guarantees distribution-level convergence and shows superior performance in offline-to-online RL tasks on D4RL benchmarks.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·May 29

Designing Active Tether-Net Systems for Space Debris Capture with Graph-Learning-Aided Mixed-Combinatorial Optimization

Active tether-net system for space debris capture using Graph Neural Network (GNN) to jointly optimize net morphology, thruster masses of maneuverable units, and controller aiming points. GNN reduces mixed combinatorial nonlinear programming (MCNLP) to nonlinear programming (NLP) solved via Particle Swarm Optimization with gradient-based refinement, achieving faster convergence than direct MCNLP solving.

Papers Reasoning

SIG

HYP

arXiv cs.LG·May 29

Causal Intelligence for Constraint-Aware Intervention Design to Induce State Transitions

COAST is a causal-intelligence approach for designing constrained interventions that induce state transitions. The system learns context-specific causal graphs, attributes distributional shifts to mechanism-level causal drivers, and uses multi-objective optimization balancing transition efficacy, intervention complexity, and target-state stability. Validated on synthetic benchmarks and real biological datasets.

Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·May 29

LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers

LoRe is a training-free inference-time wrapper optimizing diffusion-based neural solvers for combinatorial optimization. It enforces per-step interaction-evaluation budgeting, dynamically routing computation to high-conflict/high-uncertainty interactions. On MIS and TSP, LoRe achieves ×8 speedup, ×12 peak-memory reduction (MIS) and ×15 speedup, ×44 memory reduction (TSP n=1000).

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 29

Learning Robust and Task-Invariant Functional Representation from fMRI through Siamese Self-Supervised Learning

BrainSimSiam, a lightweight self-supervised learning framework, learns robust representations from fMRI data without labels. Using positive-only pairs, it generalizes across multiple tasks (classification, regression) and outperforms supervised baselines, reducing computational requirements for foundation models in neuroimaging.

Benchmarks

SIG

HYP

arXiv cs.AI·May 29

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

Study on long chain-of-thought traces used for LLM supervised fine-tuning. Researchers identify "harmful continuation": when reasoning continues after the answer is sufficiently supported. Removing these continuations improves fine-tuning outcomes. They propose HCC (Harmful Continuation Cut), a lightweight proxy to detect these boundaries.

Reasoning Fine-tuning Papers

SIG

HYP

arXiv cs.LG·May 29

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

Study on LLM reward design failures in sparse structured RL. Authors identify two dominant failure modes (reward flooding, semantic misunderstanding) and propose diagnostic-driven iterative refinement. On MiniGrid, DoorKey-8x8 improves from 2.3% to 97.6% success; KeyCorridor from 31.2% to 86.7%. Failure-mode taxonomy is the primary mechanism.

Reinforcement learning Llama Prompt engineering

SIG

HYP

arXiv cs.AI·May 29

When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop

Theoretical study of foundation models trained on synthetic data from other model iterations. Authors show that human curation of one model can degrade alignment of other models through cross-model interactions, unlike isolated settings where it always improves alignment.

Alignment Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·May 29

Provably Secure Agent Guardrail

New arXiv paper proposing ePCA (Proof-Constrained Action), a formal verification security framework for AI agents. Agents must formalize intentions into first-order logical constraints before executing physical operations, bypassing empirical semantic guardrails. Evaluations show 0% attack success rate and 0% false positive rate across tested scenarios.

AI Agents AI safety Alignment

SIG

HYP

arXiv cs.AI·May 29

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

DenseSteer is a training-free inference-time steering method that enhances mathematical reasoning in small models (≤3B parameters) by modulating internal representations toward dense reasoning patterns. On Qwen-2.5, the approach shows that more proficient reasoning uses fewer steps but higher information density per step.

Reasoning Fine-tuning Benchmarks

SIG

HYP

arXiv cs.AI·May 29

Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility

New Data-Model Compatibility (DMC) metric to assess dataset suitability for reasoning distillation to smaller models. DMC jointly considers data quality, relative difficulty, and student model capability. Validation across multiple student models and tasks shows strong correlation with distillation performance and improvements via dynamic dataset selection during training.

Reasoning Fine-tuning Benchmarks

SIG

HYP

arXiv cs.LG·May 29

Feature Geometry of LoRA Adapters: A Sparse Autoencoder Analysis of Representational Divergence in Fine-Tuned Language Models

Study of LoRA-induced representation geometry using Sparse Autoencoders on Gemma-2-9B. Researchers observe weak geometric alignment between LoRA feature dictionaries and pretrained SAEs, suggesting LoRA creates distinct representational structures in the residual stream.

Fine-tuning AI safety Papers

SIG

HYP

arXiv cs.LG·May 29

Towards Continuous-time Causal Foundation Models

Paper proposing continuous-time causal foundation models for time series using stochastic differential equations (SDEs). Introduces trajectory-law invariance criterion and three-tier taxonomy. Validation on pharmacokinetic and physical-system data shows fine-grid integration outperforms naive approach on 8/8 configurations (p<1/256).

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·May 29

TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models

TaxDistill applies knowledge distillation to improve metagenomic taxonomic annotation. GenomeOcean, a 500M-parameter genomic foundation model, generates soft labels to train a lightweight student network, reducing noise from initial retrieval tools. On 7 CAMI2 datasets, TaxDistill improves MMseqs2's F1 score from 0.763 to 0.941 on the Gastrointestinal dataset.

Papers Fine-tuning Benchmarks

SIG

HYP

arXiv cs.LG·May 29

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

COM, a strategy for time series LLMs, integrates geometric constraints into token embedding initialization and training. It preserves inherent continuity and ordinality of time series, improving performance across multiple time series analysis benchmarks.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 29

Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

arXiv paper proposing multi-agent architecture with semantic memory and caching to mitigate LLM hallucinations. Three-stage pipeline (FrontEndAgent, SecondLevelReviewer, ThirdLevelReviewer) evaluated on 310 prompts. Results: THS reduction of -31.3% to -35.9%, 47.3% cache hit rate, 47% reduction in LLM calls. No retraining required.

AI Agents Multi-agent AI safety

SIG

HYP

arXiv cs.LG·May 29

Self-Play Reinforcement Learning under Imperfect Information in Big 2

Self-play RL study in Big 2, a four-player imperfect-information card game. PPO outperforms Q-learning, SARSA, and Monte Carlo Q-approximation against random, greedy, and heuristic opponents. Moderate entropy regularization and current-policy self-play improve performance in this controlled multiplayer setting.

Reinforcement learning Multi-agent Benchmarks

SIG

HYP

arXiv cs.LG·May 29

Molecular Lead Optimization via Agentic Tool Planning

TRACE, an LLM-reasoning agent for drug lead optimization, formulates tool selection as sequential decision-making over action trajectories. The approach improves ADMET properties while preserving critical molecular substructures, outperforming baselines on multiple optimization tasks.

AI Agents Reasoning Papers

SIG

HYP

arXiv cs.AI·May 29

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis

Researchers propose a disagreement-based audit pipeline to evaluate LLMs deployed by federal agencies for categorizing public comments. Analyzing 1,260 USDA comments across four LLMs, inter-model thematic divergence exceeds within-model prompt variation, and human annotators introduce interpretive framings absent from the ensemble's collective output.

Evals Reasoning Regulation

SIG

HYP

arXiv cs.AI·May 29

BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation

BEAMS establishes benchmarks to evaluate AI tools for modeling and simulation. The open-source sd ai project tests multiple LLMs on tasks including causal translation, model iteration, and causal reasoning. Results show AI tools perform better at qualitative discussion than causal reasoning and quantitative error fixing.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.AI·May 29

Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction

Theoretical paper on stabilizing off-policy temporal-difference learning with function approximation. Proposes BA-TDC and BA-TDRC, replacing TDC's auxiliary matrix with behavior Bellman matrix. Linear analysis with convergence proof under Hurwitz stability condition; experiments on Markov chains and classical counterexamples.

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 29

Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

STHTD-MP, a new off-policy temporal-difference method, replaces the covariance metric with the behavior-policy Bellman matrix in the primal-dual saddle-point formulation. Formal convergence analysis and spectral comparison with GTD2-MP show potential gains on benchmarks (Random Walk, Boyan Chain).

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 29

Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text

eXTC combines structured prompt optimization and reinforcement learning for text classification. The system learns a natural language rulebook first, then distills reasoning from a teacher LLM into a compact model, then expands capabilities via RL. Result: fast inference with local reasoning traces and global modular explanations of learned domain rules.

Prompt engineering Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 29

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

Redpanda introduces an Agentic Data Plane architecture using out-of-band metadata channels to enforce security policies, data classifications, and behavioral constraints outside the agent's read/write path. These channels prevent hallucinations and adversarial manipulation while maintaining tamper-proof audit trails. Demonstrated with a multi-agent portfolio rebalancing system.

AI Agents Multi-agent AI safety

SIG

HYP

arXiv cs.CL·May 29

Error as a Lens: Probing LLM Reasoning through Synthetic Misconception Generation

Framework to generate targeted synthetic errors with LLMs aligned to cognitive taxonomy (revised Bloom's). A Generation Agent drafts erroneous solutions, an Examination Agent validates consistency with specified error mode. Tested on TheoremQA, shows generating authentic errors is substantially harder than producing arbitrary wrong answers.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.AI·May 29

The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling

The Cognitive Categorical Transformer (CCT), a 306M-parameter model augmenting GPT-2 Small, incorporates category-theoretic and cognitive-science-inspired components. On WikiText-103, CCT achieves 21.27 validation perplexity versus 24.19 for GPT-2 Small baseline, a 12% relative reduction (2.92 PPL). Ablations show simplicial message passing accounts for 84% of the improvement.

GPT Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 29

Large language models reorganize representational geometry during in-context learning

arXiv paper on representational geometry during in-context learning (ICL) in LLMs. Researchers show ICL performance correlates with task representational structure and successful ICL involves geometric reorganization increasing online separability. LLM behavior follows a prototype-like algorithm.

Reasoning Papers

SIG

HYP