Page 65 of 147

AllHigh signalRecent

5877 articles

Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

Learn-by-Wire Guard (LBW-Guard) is an autonomous governance layer that supervises the AdamW optimizer during language-model training. Tested on Qwen2.5-7B with WikiText-103, LBW-Guard reduces final perplexity from 13.21 to 10.74 (−18.7%) and accelerates training by 1.10×. Under extreme learning-rate stress (LR=3e-3), AdamW fails (perplexity 1885.24) while LBW-Guard remains stable (11.57).

Qwen Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.CL·May 20

EmbGen: Teaching with Reassembled Corpora

EmbGen is a synthetic data generation pipeline that decomposes a corpus into entity-description pairs, reassembles them via embedding similarity, then generates QA pairs with proximity and cluster-specialized sampling. On three datasets, EmbGen improves Binary Accuracy by 12.5% (5M tokens) to 88.9% (20M tokens) on the most heterogeneous dataset versus baselines.

Fine-tuning RAG Embeddings

SIG

HYP

arXiv cs.LG·May 20

VCR: Learning Valid Contextual Representation for Incomplete Wearable Signals

VCR is a self-supervised framework learning robust representations from incomplete wearable sensor signals. It uses an orthogonal tokenizer to disentangle shared semantics from modality-specific residuals, combined with a missing-aware mixture-of-experts backbone. VCR improves performance on health monitoring tasks under single and multiple missing modalities.

Papers Embeddings Reinforcement learning

SIG

HYP

arXiv cs.CL·May 20

Retrieval-Augmented Linguistic Calibration

New RALC method to calibrate linguistic confidence expressions in LLMs. Models confidence as distribution of perceived probabilities, introduces Faithfulness Divergence (FD) metric, and uses retrieval-augmented rewriting. Improves faithfulness up to 66% and calibration up to 58% across three QA benchmarks and five LLM families.

RAG Evals Prompt engineering

SIG

HYP

arXiv cs.CL·May 20

A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation

MAFIG, a multi-agent framework, uses multiple LLM agents and feature-specific evaluators to generate reading comprehension items with robust difficulty control. The method constructs sequences of feature constraints yielding monotonically increasing difficulty, outperforming existing single-agent approaches.

Multi-agent AI Agents Code generation

SIG

HYP

arXiv cs.AI·May 20

AgentNLQ: A General-Purpose Agent for Natural Language to SQL

AgentNLQ, a multi-agent method, achieves 78.1% semantic accuracy on the BIRD benchmark for natural language to SQL conversion. The system uses an optimized orchestrator to plan, reflect, and self-correct queries, enriches schema with context-aware metadata, and incorporates user-provided business rules.

AI Agents Multi-agent Benchmarks

SIG

HYP

arXiv cs.AI·May 20

Discoverable Agent Knowledge -- A Formal Framework for Agentic KG Affordances (Extended Version)

Formal framework describing knowledge graph capabilities for agents. Extends VoID/DCAT standards with Agentic Affordance Profile (AAP) to specify what an agent can prove, closure assumptions, and vocabulary grounding. Identifies divergence between schema and entailment regime as epistemic failure mode.

AI Agents RAG Papers

SIG

HYP

arXiv cs.LG·May 20

StampFormer: A Physics-Guided Material-Geometry-Coupled Multimodal Model for Rapid Prediction of Physical Fields in Sheet Metal Stamping

StampFormer is a multimodal deep learning model predicting physical fields in sheet metal stamping by fusing geometry and material properties. Tested on steel/aluminium panels, it achieves <8.5% relative error in <1 second, replacing costly FEA analyses.

Papers Vision Reasoning

SIG

HYP

arXiv cs.LG·May 20

Dimensional Balance Improves Large Scale Spatiotemporal Prediction Performance

arXiv paper proposing an adaptive framework to improve spatiotemporal forecasting by harmonizing spatial and temporal feature representations. Uses low-rank matrix embedding for spatial compression and extended temporal horizon. Demonstrates substantial accuracy gains on urban traffic, meteorology, and epidemic datasets. Code available on GitHub.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 20

Agentic Trading: When LLM Agents Meet Financial Markets

Systematic review of 77 studies on LLM agents in financial trading. Only 19 studies meet minimum criteria (action output + closed-loop evaluation). Key finding: lack of comparable protocols, insufficient reproducibility (no R3 studies), and missing documentation on transaction costs and universe handling.

AI Agents Papers Evals

SIG

HYP

arXiv cs.CL·May 20

K-Quantization and its Impact on Output Performance

Empirical study of quantization impact (2-6 bits) on 8 LLMs evaluated on MMLU-Pro, CRUXEval, and MuSR. Results: 8-bit precision (Q8_0) optimal, aggressive quantization (Q2_K) acceptable but with variable losses across models/tasks. 7-9B models offer best efficiency/performance trade-off.

Benchmarks Fine-tuning

SIG

HYP

arXiv cs.CL·May 20

Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection

arXiv paper proposing LONSREX, a data synthesis pipeline to fine-tune LLMs for explainable misinformation detection. Authors identify two issues: rationales filtered on binary labels are insufficient, and stronger LLMs produce unnecessarily verbose rationales. LONSREX introduces a metric quantifying the necessity and sufficiency of each verification step.

Llama Fine-tuning Evals

SIG

HYP

arXiv cs.AI·May 20

Efficient Elicitation of Collective Disagreements

Theoretical paper on efficient elicitation of collective disagreement among voters. Introduces the plurality matrix, a generalization of pairwise comparisons, to identify minimal aggregated preference information needed for disagreement measures. Shows that certain measures (rank-variance, divisiveness) require subset size 3, not just pairwise comparisons.

Papers Evals

SIG

HYP

arXiv cs.AI·May 20

Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

Formalizes trust calibration for autonomous agents as preference learning. A policy gateway maintains a Gaussian-process posterior over human risk tolerance from binary approve/deny feedback, escalating uncertain decisions to humans. Structured as Preferential Bayesian Optimization with uncertainty-targeted querying.

AI Agents Reasoning AI safety

SIG

HYP

arXiv cs.CL·May 20

Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

A position paper critiques Uncertainty Quantification (UQ) methods for LLMs, arguing they are merely unsupervised clustering algorithms. These approaches quantify internal consistency of generations rather than external correctness, failing to detect 'confident hallucinations.' The author proposes a paradigm shift toward UQ grounded in objective truth.

AI safety Alignment Evals

SIG

HYP

arXiv cs.LG·May 20

Multi-Pedestrian Safety Warning at Urban Intersections Use Case of Digital Twin

Multi-pedestrian safety warning system at urban intersections using a tightly coupled digital twin framework with camera and ultra-wideband sensors, trajectory prediction modeling. Deployed on COSMOS testbed in New York City, delivers real-time alerts via edge-cloud computing and significantly reduces response times for vulnerable road users.

Vision Infrastructure AI safety

SIG

HYP

arXiv cs.AI·May 20

Evaluating the Utility of Personal Health Records in Personalized Health AI

Study evaluating Gemini 3.0 Flash on 2,257 patient queries with Personal Health Records (PHR) context. Significant improvement in answer helpfulness with PHR data (p<0.001). Identified gaps: temporal disorientation, rare confabulations. Evaluation framework developed to monitor LLM answer quality based on PHR context.

Gemini RAG Evals

SIG

HYP

arXiv cs.AI·May 20

EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

EMO-BOOST combines RGB/audio-focused deepfake detectors with EmoForensics, an emotion-based detector using audio-visual emotion recognition. The method models temporal consistency of emotions and improves cross-manipulation generalization by 2.1% AUC on FakeAVCeleb.

Vision Voice AI safety

SIG

HYP

arXiv cs.CL·May 20

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

MMoA proposes a recurrent Mixture-of-Agents architecture with LSTM-based routing for dynamic agent selection. On AlpacaEval 2.0, MT-Bench, and Arena-Hard, MMoA achieves 58.0% win rate (vs 59.8% for MoA) while reducing computational overhead by 4.6% through selective agent activation.

Multi-agent AI Agents Reasoning

SIG

HYP

arXiv cs.AI·May 20

Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment

arXiv paper introduces Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM's scoring function recovers skill levels its generative function was instructed to produce. On a two-stage adaptive assessment, the model recovers ~70% of intended variance (r=0.698) with systematic positive bias. GEA is strong (r>0.7) for syntactically verifiable skills but near zero for design-level skills.

Evals Reasoning AI safety

SIG

HYP

arXiv cs.AI·May 20

Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling

SIGMA is a signed graph-based multi-agent framework that explicitly models trust, conflict, and neutral relations among LLM agents. Through conflict-aware message passing and weighted aggregation, it suppresses conflicting signals and reinforces trustworthy agents. Experiments on 6 benchmarks demonstrate accuracy and conflict-resilience gains over baselines.

Multi-agent Reasoning AI Agents

SIG

HYP

arXiv cs.AI·May 20

Generative Recursive Reasoning

GRAM (Generative Recursive reAsoning Models) extends deterministic recursive reasoning models by introducing multiple stochastic latent trajectories. Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks, with unconditional generation capability.

Reasoning Papers

SIG

HYP

arXiv cs.AI·May 20

BLINKG: A Benchmark for LLM-Integrated Knowledge Graph Generation

BLINKG is a benchmark to evaluate LLM capabilities in generating Knowledge Graphs from heterogeneous data sources. Testing multiple state-of-the-art LLMs shows promising but limited performance in complex scenarios. The benchmark defines requirements for semi-automated LLM-driven KG construction.

Benchmarks Papers RAG

SIG

HYP

arXiv cs.CL·May 20

Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters

Study on cross-modal skill injection: transferring domain-expert LLM capabilities to VLMs via model merging. Systematic analysis of 3 aspects: scenarios (strong in instruction-following and cross-lingual, weak in mathematical reasoning), methods (TA and DARE outperform alternatives), hyperparameters. Avoids expensive SFT.

Fine-tuning Vision Reasoning

SIG

HYP

arXiv cs.CL·May 20

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

LambdaPO introduces pairwise preference-based policy optimization for reasoning model alignment. Unlike GRPO's monolithic baseline, LambdaPO decomposes advantage estimation into pairwise reward differentials between trajectories, weighted by policy confidence. A semantic density reward augments the optimization signal on math reasoning and QA tasks.

Reinforcement learning Reasoning Alignment

SIG

HYP

arXiv cs.AI·May 20

Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

ReElicit is a Bayesian optimization framework for tuning system prompts using only aggregate feedback. An LLM dynamically elicits a compact, interpretable feature space, then a Gaussian process selects optimized target vectors refined into deployable prompts. Across 10 tasks with 30-evaluation budget, ReElicit outperforms aggregate-only prompt optimization baselines.

Prompt engineering Reasoning

SIG

HYP

arXiv cs.AI·May 20

KAN-MLP-Mixer: A comprehensive investigation of the usage of Kolmogorov-Arnold Networks (KANs) for improving IMU-based Human Activity Recognition

Comparative study of Kolmogorov-Arnold Networks (KANs) vs MLPs for IMU-based Human Activity Recognition (HAR). KANs excel on clean data but fail on noisy real-world datasets. Proposed hybrid KAN-MLP architecture achieves +5.33% macro F1-score improvement across 8 public datasets, outperforming pure baselines.

Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 20

Drifting Objectives for Refining Discrete Diffusion Language Models

TokenDrift applies drifting methods (objective correction) to discrete diffusion language models. The technique lifts categorical predictions to soft-token features, applies anti-symmetric drifting in a frozen semantic space, and backpropagates to DDLM logits. On MDLM and DUO, TokenDrift reduces generation perplexity by 89% and 86% at 4 NFE.

Papers Code generation Reasoning

SIG

HYP

arXiv cs.AI·May 20

Swimming with Whales: Analysis of Power Imbalances in Stake-Weighted Governance

Analysis of power imbalances in stake-weighted governance of Proof-of-Stake blockchains using the Penrose-Banzhaf power index. Shows how few large-stake users can control decision-making despite not owning majority stakes. Provides theoretical and empirical findings on Project Catalyst data.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 20

Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

Microservice architecture for Document AI pipelines in production: classification, OCR, and structured field extraction via LLM. Processes thousands of multi-page documents per hour. Key findings: OCR dominates end-to-end latency (not LLM parsing), system saturation determined by shared GPU capacity. Concrete architectural patterns for production deployment.

Infrastructure Code generation RAG

SIG

HYP

arXiv cs.CL·May 20

Base Models Look Human To AI Detectors

Commercial AI detectors (GPTZero, Pangram) classify base model text as human-written, unlike instruction-tuned versions. Researchers propose HIP (Humanization by Iterative Paraphrasing), a pipeline that minimally fine-tunes a base model into an iterative paraphraser. Tested on Llama-3 and Qwen-3 (0.6B-70B), HIP improves human-likeness while preserving semantics.

Llama Qwen Fine-tuning

SIG

HYP

arXiv cs.LG·May 20

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

LILAC+ proposes a framework for safe continual reinforcement learning in nonstationary environments. The system combines three adaptive mechanisms: context-based safety constraints, adaptation-speed constraints, and budget-to-state enforcement. Evaluated in simulated driving, it reduces safety violations under distribution shift while maintaining competitive task performance.

Reinforcement learning AI safety Alignment

SIG

HYP

arXiv cs.LG·May 20

From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning

CPSS (Constraint Projection Safety Shield) converts cumulative safety budgets into adaptive state-level control constraints for nonstationary reinforcement learning. The mechanism dynamically adjusts risk thresholds based on context, guarantees per-state threshold satisfaction, and reduces safety violations in highway merging scenarios.

Reinforcement learning AI safety Reasoning

SIG

HYP

arXiv cs.LG·May 20

An Integrated Forecasting Prototype for Emergency Department Boarding Time to Support Proactive Operational Decision Making

Forecasting prototype for emergency department boarding time using time series models (DLinear, NLinear) on real hospital data. Integrates weather, holidays, and local events. Prediction horizons: 6, 8, 10, 12, and 24 hours. MLOps web application developed for operational deployment.

Benchmarks Infrastructure Tools

SIG

HYP

arXiv cs.CL·May 20

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

Study on LLM-based error correction for low-resource Frisian ASR. GPT models improve WER performance, including on offline dataset controlling for data contamination. Detailed error analysis reveals model correction patterns.

GPT Llama Evals

SIG

HYP

arXiv cs.LG·May 20

Spectral Gradient Surgery for Domain-Generalizable Dataset Distillation

Spectral Gradient Surgery (SGS) improves dataset distillation for out-of-distribution generalization. The method disentangles class-discriminative from domain-specific information in compressed synthetic data via spectral analysis of cross-domain gradients. SGS integrates as a plug-and-play module with existing Distribution Matching methods.

Benchmarks Fine-tuning

SIG

HYP

arXiv cs.CL·May 20

LLM-Based Financial Sentiment Analysis in Arabic: Evidence from Saudi Markets

NLP framework for Arabic financial sentiment analysis on Saudi market. 84K-sample corpus combining official news and social media, with Transformer-based NER and five-class classification scheme. Demonstrates reliable and scalable Arabic financial sentiment analysis.

SIG

HYP

arXiv cs.LG·May 20

Robust Basis Spline Decoupling for the Compression of Transformer Models

New transformer compression method using B-spline-based decoupling. R-CMTF-BSD algorithm employs constrained coupled matrix-tensor factorization to reduce parameters while maintaining accuracy. Validated on Vision Transformer and Swin Transformer architectures with substantial parameter reduction.

Benchmarks Vision

SIG

HYP

arXiv cs.LG·May 20

Simply Stabilizing the Loop via Fully Looped Transformer

Fully Looped Transformer addresses training instability in looped models by reusing Transformer blocks. Two parameter-free modifications: inter-loop signal distribution and attention injection. Stable up to 12 iterations, improves downstream performance by 13.2%, and enables adjustable inference-time compute via loop iteration control.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.LG·May 20

Accurate Evaluation of Quickest Changepoint Detectors via Non-parametric Survival Analysis

Non-parametric estimators (KM-ARL and KM-ADD) for evaluating quickest changepoint detectors on finite and irregular sequence lengths. Uses survival analysis analogy to model detection probabilities under truncation. Python code provided.

Benchmarks Papers

SIG

HYP