May 2026

3149 articles

FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information

FinTagging is a benchmark for evaluating LLMs on extracting and tagging financial data with XBRL. It decomposes the task into two stages: FinNI (financial numeric identification) and FinCL (mapping to full US GAAP taxonomy). Testing shows models generalize well in extraction but struggle significantly with fine-grained concept linking.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

SIPO: Stabilized and Improved Preference Optimization for Aligning Diffusion Models

SIPO stabilizes diffusion model alignment to human preferences by addressing training instability and off-policy bias. The method introduces DPO-C&M to clip uninformative timesteps and applies timestep-aware importance reweighting. Experiments on SD1.5, SDXL, CogVideoX-2B/5B, and Wan2.1-1.3B demonstrate improvements over Diffusion-DPO.

Image generation Video generation Reinforcement learning

SIG

HYP

arXiv cs.CL·May 19

An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration

Controlled study comparing human vs synthetic soft-labels on MNIST. Human soft-labels improve model calibration and alignment with human uncertainty, beyond mere correction of mislabeled data. Shows primary value lies in regularization and stable convergence across training runs.

Evals Alignment AI safety

SIG

HYP

arXiv cs.LG·May 19

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

Study of adversarial action removal attacks in self-play reinforcement learning. An attacker selectively masks legal actions from the victim's action set. Experiments on poker (6 to 5,531 states) and two non-poker domains: learned masking causes substantially more damage than random masking, persists across Q-learning/PPO/NFSP/DQN, transfers between agents, and is amplified by self-play.

Reinforcement learning AI safety Benchmarks

SIG

HYP

arXiv cs.LG·May 19

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

Researchers introduce IBPO (Implicit Behavior Policy Optimization), a credit assignment method for reinforcement learning with LLMs. By comparing multiple reasoning trajectories, the framework transforms sparse terminal rewards into step-sensitive learning signals, reducing gradient variance and improving stability on mathematical and code reasoning benchmarks.

Reinforcement learning Reasoning Code generation

SIG

HYP

arXiv cs.LG·May 19

Mirror Descent-Type Algorithms for the Variational Inequality Problem with Functional Constraints

Mirror descent-type algorithms for variational inequality problems with functional constraints. Proposed methods alternate between productive and non-productive steps based on constraint values, with optimal convergence rates for bounded monotone operators and Lipschitz convex constraints. Applicable to GANs, reinforcement learning, and adversarial training.

Reinforcement learning Papers Alignment

SIG

HYP

arXiv cs.AI·May 19

Resource-Element Energy Difference for Noncoherent Over-the-Air Federated Learning

New REED physical-layer primitive for noncoherent over-the-air federated learning aggregation. Maps positive/negative parts of model updates to paired transmit energies, removing need for synchronization and instantaneous CSI. Derives exact variance expressions for Rayleigh fading channels.

Infrastructure Papers

SIG

HYP

arXiv cs.AI·May 19

RAG-based EEG-to-Text Translation Using Deep Learning and LLMs

RAG pipeline for EEG-to-text decoding combining an EEG encoder aligned with semantic embeddings, vector retrieval, and an LLM. On ZuCo dataset, the method outperforms random baseline with cosine similarity of 0.181±0.022 vs 0.139±0.029 (30.45% improvement), without teacher forcing at inference.

RAG Embeddings Vector search

SIG

HYP

arXiv cs.AI·May 19

The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure

Study on multi-agent systems: 'semantic hijacking' attacks exploit agent confidence. Paradox identified: increasing Worker capability raises attack success rate from 18.4% to 63.9%. Mediation analysis reveals 'linguistic certainty' of stronger agents drives vulnerability. Proposed solution: heterogeneous ensemble verification reduces attack success rate to 2%.

Multi-agent AI Agents AI safety

SIG

HYP

arXiv cs.CL·May 19

KIT-TIP-NLP at MultiPride: Continual Learning with Multilingual Foundation Model

Multi-stage framework for detecting reclaimed slurs in multilingual social media (English, Spanish, Italian). Uses XLM-RoBERTa with GPT-4o-mini back-translation data augmentation (×3 corpus), dynamic undersampling, and language-specific threshold optimization. Achieves 2-5% absolute F1 improvement without model retraining.

Fine-tuning RAG Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

TABOM, a post-training method for Diffusion Language Models, aligns optimization with the multi-step easy-to-hard decoding trajectory observed at inference. Via Boltzmann modeling of unmasking preferences, it derives a tractable pairwise ranking objective that reduces training-inference discrepancy and improves performance on new domains.

Fine-tuning Reasoning Papers

SIG

HYP

arXiv cs.CL·May 19

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

DISCA, a training-free inference-time method, culturally aligns LLMs via within-country sociodemographic disagreement. Tested on 20 countries and 7 backbones (2B–70B), it reduces cultural misalignment by 10–24% on MultiTP without modifying model weights.

Alignment AI safety Papers

SIG

HYP

arXiv cs.CL·May 19

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

New MoLF (Mixture of LoRA and Full) method combines full fine-tuning and LoRA via dynamic optimizer-level routing. Tested on Gemma-3-1B, Qwen2.5-1.5B/3B across SQL, Medical QA, Counterfactual Knowledge. MoLF-Efficient outperforms adaptive LoRA approaches by 20% (Fact) and 9% (Med/SQL). Code open-sourced.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 19

The Frequency Confound in Language-Model Surprisal and Metaphor Novelty

An arXiv study examines the relationship between language-model surprisal and metaphor novelty. Across 8 Pythia model sizes and 154 checkpoints, lexical frequency predicts metaphor novelty better than surprisal. The surprisal-novelty association peaks at early training stages then declines, mirroring the surprisal-frequency association timing.

Papers Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 19

ClawGym: A Scalable Framework for Building Effective Claw Agents

ClawGym is a framework for developing agents capable of executing multi-step workflows over local files and persistent tools. Authors construct ClawGym-SynData (13.5K synthesized tasks), train ClawGym-Agents via supervised fine-tuning and RL, and propose ClawGym-Bench (200 instances) for evaluation.

AI Agents Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

From Chatbots to Confidants: A Cross-Cultural Study of LLM Adoption for Emotional Support

Cross-cultural study of 4,641 participants across 7 countries shows LLM emotional support adoption ranges from 20% to 59%. Users aged 25-44, religious, married, and higher socioeconomic status report greater trust. Requests focus on loneliness, stress, relationship conflicts, and mental health. Corpus of 731 multilingual prompts collected.

AI safety Alignment Regulation

SIG

HYP

arXiv cs.CL·May 19

Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

Comparative study of fine-tuning vs. in-context learning on LLMs using formal language tasks. Fine-tuning outperforms ICL on in-distribution generalization, but both perform equally on out-of-distribution. Inductive biases diverge at higher proficiency levels. ICL shows sensitivity to vocabulary and model size.

Fine-tuning Prompt engineering Benchmarks

SIG

HYP

arXiv cs.CL·May 19

STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation

STEM proposes a framework for Knowledge Graph-based Question Answering (KGQA) that reframes multi-hop reasoning as schema-guided graph search. Uses a Semantic-to-Structural Projection pipeline and Triple-Dependent GNN to generate a Global Guidance Subgraph. Achieves SOTA on multiple multi-hop benchmarks.

RAG Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

NaviRAG: Towards Active Knowledge Navigation for Retrieval-Augmented Generation

NaviRAG introduces a RAG framework shifting from passive segment retrieval to active knowledge navigation. The system structures documents into semantic hierarchies and uses an LLM agent to iteratively navigate, identify information gaps, and retrieve content at appropriate granularity levels. Results show improved retrieval recall and QA performance on long-document benchmarks over conventional RAG.

RAG AI Agents Reasoning

SIG

HYP

arXiv cs.CL·May 19

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

Comparative study of interpretability in Mixture-of-Experts (MoE) architectures vs dense networks. MoE experts show lower neuronal polysemanticity than dense FFNs, especially with sparse routing. Experts function as fine-grained linguistic task specialists (e.g., closing LaTeX brackets), not broad domain specialists. Code released.

SIG

HYP

arXiv cs.CL·May 19

Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

Researchers localize 'entity cells' in MLP neurons across language models (Qwen2.5-7B, etc.). These selectively activated neurons encode entity-specific facts. Suppressing one cell erases recall for that entity alone; activating it recovers knowledge even without context. Cells remain stable across aliases, acronyms, and multilingual forms.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

Novel DSKD-CMA-GA method for knowledge distillation between LLMs with mismatched vocabularies. Uses generative adversarial learning to align key-query distributions. Modest but consistent ROUGE-L gains (+0.37 average on out-of-distribution data).

Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

PCFJudge, an inference-time method, evaluates factuality by rerunning a listwise prompt across multiple candidate orderings and aggregating scores. On RewardBench 2 Factuality with K=7 permutations, top-1 accuracy improves from 86% to 91.33% (GPT-5.4) and 86.33% to 89.67% (Claude Sonnet 4.6).

Evals GPT Claude

SIG

HYP

arXiv cs.CL·May 19

Locally Coherent Parallel Decoding in Diffusion Language Models

CoDiLA combines diffusion with local autoregressive decoding for parallel code generation. A compact auxiliary AR model (0.6B parameters) ensures syntactic coherence on diffusion latents, eliminating artifacts while preserving bidirectional generation and sub-linear latency.

Code generation Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Orthologic for SAT Solving

New algorithm for formula entailment in orthologic (sound approximation of classical logic) without costly preprocessing phase, O(n²(1+|A|)) worst-case complexity. Synthetic SAT benchmarks via Tseitin encoding: instances hard for SOTA solvers but efficiently solved by orthologic. Orthologic normalization improves solving time on hard problems.

Benchmarks Reasoning

SIG

HYP

arXiv cs.CL·May 19

CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

CounterRefine adds a lightweight repair layer for RAG: after an initial answer, the system issues answer-conditioned queries to retrieve candidate-specific counterevidence, then applies a deterministically-validated KEEP/REVISE refinement step. On SimpleQA, improves baseline by up to 5.8 correct-rate points; modifies 5.6% of outputs with 180 beneficial changes versus 8 harmful ones.

RAG Reasoning Evals

SIG

HYP

arXiv cs.CL·May 19

Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention

SPOT (Surgical Post-Training) is an on-policy distillation framework that injects reasoning capabilities into LLMs while preserving prior knowledge. With 4k rectified math pairs, it improves Qwen3-8B by 6.2% on average in 16 minutes on 8x H800, using KL-constrained reward formulation and minimal-edit error correction pipeline.

Reinforcement learning Reasoning Fine-tuning

SIG

HYP

arXiv cs.CL·May 19

Embracing Anisotropy: Turning Massive Activations into Interpretable Control Knobs for Large Language Models

LLMs exhibit highly anisotropic internal representations with massive activations. Rather than treating them as artifacts, the authors identify them as interpretable functional units using a magnitude-based criterion. Steering applied to these critical dimensions outperforms conventional whole-dimension steering in domain adaptation and jailbreaking scenarios.

AI safety

SIG

HYP

arXiv cs.CL·May 19

AI Alignment Breaks at the Edge

arXiv paper showing AI alignment fails on edge cases: value conflicts, multi-stakeholder disagreement, epistemic ambiguity. Scalar rewards and average-case evaluation hide these failures. Authors propose 'Edge alignment': detection, evaluation, and governance to surface critical cases. Tested on 91 edge cases across 4 contemporary models.

Alignment AI safety Evals

SIG

HYP

arXiv cs.CL·May 19

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

STING is an automated red-teaming framework measuring multi-turn illicit assistance in LLM agents. It constructs step-by-step illicit plans grounded in benign personas and uses judge agents to track completion. Multilingual evaluation across six non-English languages shows attack success does not consistently increase in lower-resource languages, diverging from chatbot findings.

AI Agents AI safety Evals

SIG

HYP

arXiv cs.CL·May 19

Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models

Comparative study on humans' and LLMs' ability to distinguish anomalous sentences from truly nonsensical ones. Analysis of five semantically deviant datasets with and without context. Finding: most sentences rated as anomalous can be interpreted with context; LLMs effectively generate plausible contexts.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.CL·May 19

Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI

Experimental study testing Chomsky's critique of LLMs: GPT-2 small and LSTM trained on syntactically impossible languages (reversed sentences, parity-based negations). GPT-2 shows lower perplexity on natural language (loss ratios up to 2.25× on reversed conditions), LSTM minimal differences. Authors propose functionalist paradigm against Chomsky's rationalist perspective.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Fix the Structural Bottleneck: Context Compression via Explicit Information Transmission

ComprExIT, a new context compression framework, addresses structural bottlenecks in existing LLM-based compressors through explicit information transmission. On 12 datasets, it improves average F1 by 18.5%, adds ~1% trainable parameters, and achieves 2x faster compression than baselines.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 19

PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning

PEGRL is a two-stage RL framework for LLM-based machine translation. It uses post-editing as an auxiliary task to stabilize training and guide optimization. Tests on EN→FI, EN→TR, EN↔ZH show consistent gains; EN→TR achieves performance comparable to DeepSeek-V3.2 on COMET-KIWI.

Reinforcement learning Code generation Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Embodied Task Planning via Graph-Informed Action Generation with Large Language Models

GiG, a planning framework for embodied agents, uses Graph-in-Graph architecture with GNN to encode environmental states and structure experience memory. A bounded lookahead module enhances planning via symbolic transition logic. Evaluated on Robotouille and ALFWorld, GiG outperforms baselines with +22% to +37% Pass@1 gains.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

"The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework

COMPACT, a multi-teacher CoT distillation framework, adaptively fuses supervisions from multiple LLMs into compact student models. It dynamically weights teacher gradients using three metrics: graph-based consensus, mutual-information-based adaptability, and loss-based difficulty. Achieves SOTA results across benchmarks while mitigating catastrophic forgetting.

Reasoning Fine-tuning Papers

SIG

HYP

arXiv cs.CL·May 19

Double-Calibration: Towards Reliable LLMs via Calibrating Knowledge and Reasoning Confidence

DoublyCal, a double-calibration framework, improves LLM reliability by quantifying epistemic uncertainty in retrieved evidence and reasoning. A lightweight proxy model generates Knowledge Graph evidence with calibrated confidence, guiding a black-box LLM toward more accurate and well-calibrated predictions.

Reasoning RAG Evals

SIG

HYP

arXiv cs.CL·May 19

ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language

Annotation framework using multilingual Llama3.1 as teacher model to distill expert knowledge for medical text tagging in Polish. DistilBERT achieves F1 > 0.80 across 5 clinical categories (Radiology, Oncology, Cardiology, Hypertension, Pathology) with 500× fewer parameters and 300× lower GPU VRAM than LLMs.

Llama Fine-tuning Code generation

SIG

HYP

arXiv cs.CL·May 19

Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation

Study evaluating 8 multimodal models (Gemini-2.5-Pro, o3, etc.) on robustness against cognitive biases in Chinese short-video misinformation. Manually annotated dataset of 200 videos across 4 health domains. Gemini-2.5-Pro achieves 71.5/100, o3 scores 35.2. Models are susceptible to social cues like authoritative channel IDs.

Vision Benchmarks AI safety

SIG

HYP

arXiv cs.CL·May 19

Rethinking Table Pruning in TableQA: From Sequential Revisions to Gold Trajectory-Supervised Parallel Search

TabTrim, a novel table pruning framework for TableQA, replaces sequential revisions with gold trajectory-supervised parallel search. The system uses intermediate sub-tables from gold SQL queries to train a pruner and verifier. TabTrim-8B achieves 73.5% average accuracy, outperforming strongest baselines by 3.2% (79.4% on WikiTQ, 61.2% on TableBench).

Benchmarks Reasoning Papers

SIG

HYP

arXiv cs.CL·May 19

QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

QuCo-RAG proposes dynamic RAG grounded in pre-training corpus statistics rather than model-internal signals. It identifies low-frequency entities and verifies their co-occurrence in 4 trillion tokens using Infini-gram. On multi-hop QA benchmarks, it gains 5–12 EM points over baselines with OLMo-2, and up to 14 points on Llama-3, Qwen2.5, GPT-4.

RAG Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

ShareChat: A Dataset of Chatbot Conversations in the Wild

ShareChat is a corpus of 142,808 conversations (660,293 turns) collected from ChatGPT, Perplexity, Grok, Gemini, and Claude between April 2023 and October 2025. The dataset preserves native affordances (citations, reasoning traces, code artifacts) across 95 languages and enables analysis of cross-platform differences in intent satisfaction, citation strategies, and response latency.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.CL·May 19

GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning

GraphMind combines GNN and LLM for multi-step mathematical reasoning. The framework models reasoning as an evolving heterogeneous graph where nodes (conditions, theorems, conclusions) and edges (logical dependencies) enable dynamic theorem selection and iterative conclusion generation. Improved results on QA benchmarks.

Reasoning AI Agents Benchmarks

SIG

HYP

arXiv cs.CL·May 19

You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

TAQ (Task-Aware Quantization) is a training-free post-training quantization method that dynamically allocates precision to task-relevant layers using unlabeled calibration prompts. Three variants (TAQ-IS, TAQ-KL, TAQ-O) estimate layer importance from hidden representations. Significant gains in accuracy-memory ratio validated on real hardware throughput and latency.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 19

LISTEN to Your Preferences: An LLM Framework for Multi-Objective Selection

LISTEN is an agentic LLM framework for selecting among multiple options with competing objectives. Two iterative algorithms: LISTEN-U refines a parametric utility function, LISTEN-T uses tournament-style selection on small batches. Evaluated on flight booking, shopping, exam scheduling. Code available.

AI Agents Prompt engineering Reasoning

SIG

HYP

arXiv cs.CL·May 19

Tongyi DeepResearch Technical Report

Tongyi DeepResearch is an agentic LLM with 30.5 billion parameters (3.3 billion activated per token) designed for long-horizon deep research tasks. Trained via agentic mid-training and post-training with automatic data synthesis, it achieves state-of-the-art on 7 benchmarks including Humanity's Last Exam and BrowseComp. Model and framework are open-sourced.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

Beacon is a diagnostic benchmark measuring sycophancy (LLMs' tendency to prioritize user agreement over factual accuracy) across 12 SOTA models. Authors identify stable linguistic and affective sub-biases scaling with model capacity, and propose prompt-level and activation-level interventions to modulate them.

Alignment AI safety Evals

SIG

HYP

arXiv cs.CL·May 19

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

EvolveR is a framework enabling LLM agents to learn from their own experiences through a closed-loop lifecycle. It combines offline self-distillation (extracting strategic principles from interaction trajectories) and online interaction (retrieving principles to guide decisions). Tested on multi-hop QA benchmarks, it outperforms existing agentic baselines.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.CL·May 19

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

History-Echoes framework investigates how conversational history biases LLM outputs. Using Markov chain modeling and geometric analysis of hidden representations, the study shows behavioral persistence manifests as a geometric trap in latent space, validated across 3 model families and 6 datasets.

Papers Reasoning Alignment

SIG

HYP

arXiv cs.CL·May 19

Unlocking the Potential of Diffusion Language Models through Template Infilling

Template Infilling (TI) is a conditioning methodology for Diffusion Language Models that aligns structural anchors across the entire response space, replacing prefix prompting. Evaluated on mathematical reasoning, code generation, and trip planning, TI achieves 9.40% improvements and accelerates multi-token generation.

Prompt engineering Code generation Reasoning

SIG

HYP

arXiv cs.CL·May 19

Evaluating Language Models' Evaluations of Games

arXiv paper evaluating how language and reasoning models assess board games. Testing 100+ games with 450 human judgments, reasoning models align better with humans than standard LLMs for evaluating game fairness and fun. Paradox: as models approach game-theoretic optimality, their fit to human judgments weakens.

Reasoning Evals Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

arXiv study evaluating ChatGPT's consistency in coding communication data across demographic groups (gender, race). Authors adapt an automated scoring framework and test ChatGPT on three collaborative task types. Result: ChatGPT coding shows consistency comparable to human raters across groups.

GPT Evals Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models

Guided Topology Diffusion (GTD) uses graph diffusion models to dynamically generate optimal communication topologies for multi-agent LLM systems. The iterative framework, guided by a proxy model predicting multi-objective rewards (accuracy, utility, cost), adapts topologies to tasks without gradient-based optimization, outperforming static approaches.

Multi-agent AI Agents Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

Researchers propose the Refusal Index (RI), a metric measuring LLMs' ability to refuse questions beyond their knowledge. RI correlates refusal probability with error probability using Spearman's rank correlation. Testing across 16 models and 5 datasets shows LLMs refuse unreliably despite high factual accuracy.

Evals AI safety Alignment

SIG

HYP

arXiv cs.CL·May 19

We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

AMBS (Adaptive Multi-Branch Steering) aligns LLMs on three simultaneous objectives (Helpfulness, Harmlessness, Honesty) via a 1-to-N Transformer framework. A shared representation is replicated into N objective-specific pathways with constrained transformations. Results: 56.5% avg WR on LLaMA-2-7B, 189 Tok/s.

Alignment AI safety Reasoning

SIG

HYP

arXiv cs.CL·May 19

When TableQA Meets Noise: A Dual Denoising Framework for Complex Questions and Large-scale Tables

EnoTab is a dual denoising framework for TableQA addressing complex questions and large-scale tables. It decomposes questions into minimal semantic units and prunes tables via an explicit evidence tree with post-order node rollback mechanism for abnormal states. Achieves strong performance on complex TableQA tasks.

Reasoning RAG Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Early Stopping Chain-of-thoughts in Large Language Models

ES-CoT detects answer convergence during chain-of-thought generation to stop inference early. The method reduces inference tokens by 16.08% on average across six reasoning benchmarks while maintaining comparable accuracy to standard CoT.

Reasoning Prompt engineering Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models

Novel 1-bit LLM quantization method leveraging pre-trained models. Uses consistent progressive training (forward/backward) with binary-aware initialization and dual-scaling compensation to convert weights to binarized representation. Reduces training costs and accuracy degradation versus existing approaches.

Fine-tuning Benchmarks Infrastructure

SIG

HYP

arXiv cs.CL·May 19

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

Novel data selection strategy for LLM alignment based on DPO implicit reward gap. By targeting harder preference examples (smaller gap), the method achieves superior performance with only 10% of original data across multiple benchmarks.

Reinforcement learning Alignment Evals

SIG

HYP

arXiv cs.CL·May 19

LaPA$^2$: Length-Aware Prefix and Prompt Attention Augmentation for Long-Form Controllable Text Generation

LaPA² addresses attention dilution in long-form controllable text generation. The method applies length-aware logarithmic scaling to amplify prefix attention weights, counteracting the natural decay of control signals. Training-free framework compatible with both soft and hard prefixes.

Prompt engineering Code generation Reasoning

SIG

HYP

arXiv cs.AI·May 19

Geometry-aware 4D Video Generation for Robot Manipulation

4D video generation model for robot manipulation enforcing multi-view 3D consistency through cross-view pointmap alignment supervision. Generates spatio-temporally aligned video sequences from single RGB-D image per view without camera poses as input. Demonstrates superior visual stability and robot end-effector trajectory recovery on simulated and real-world datasets.

Robotics Video generation Vision

SIG

HYP

arXiv cs.CL·May 19

Factual Inconsistencies in Multilingual Wikipedia Tables

Study of factual inconsistencies in multilingual Wikipedia tables. Researchers developed methodology to collect and analyze tables across 300+ language versions of Wikipedia, identifying inconsistency categories. Implications for fact verification and reliability of AI systems trained on Wikipedia.

Benchmarks Evals RAG

SIG

HYP

arXiv cs.CL·May 19

FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information

FinTagging is a benchmark for evaluating LLMs on extracting and tagging financial data with XBRL. It decomposes the task into two stages: FinNI (extracting numeric entities) and FinCL (mapping to the full US GAAP taxonomy). Testing shows models extract well but struggle with fine-grained concept linking across 10k+ concepts.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·May 19

Sustainability via LLM Right-sizing

Empirical study comparing 11 LLMs (GPT-4o, Gemma-3, Phi-4, etc.) across 10 everyday occupational tasks. GPT-4o delivers superior performance but at higher cost; smaller models achieve strong results with better efficiency. Proposes task-aware sufficiency assessments over performance-maximizing benchmarks.

Benchmarks Evals Open source

SIG

HYP

arXiv cs.CL·May 19

Responsible Federated LLMs via Safety Filtering and Constitutional AI

Research integrating safety filtering and Constitutional AI into federated LLM training (FedLLM). Authors demonstrate these techniques improve safety by over 20% on AdvBench, mitigating risks of unsafe model aggregation and redistribution across clients.

AI safety Alignment Reinforcement learning

SIG

HYP

arXiv cs.CL·May 19

SEDD: Scalable and Efficient Dataset Deduplication with GPUs

SEDD is a GPU-accelerated deduplication framework using MinHash LSH. It outperforms SlimPajama's CPU tool by 158× and NVIDIA NeMo Curator's GPU tool by 7.8× on 30M documents. MinHash signature generation 375× faster. Deduplicates 1.2T tokens in 3 hours on 32-GPU V100 cluster.

Benchmarks Infrastructure Open source

SIG

HYP

arXiv cs.CL·May 19

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

LightTransfer converts language models (LLaMA, Mistral, QwQ-STILL) into hybrid architectures without training. The method identifies lazy layers and replaces full attention with streaming attention, reducing KV cache costs. Results: up to 2.17× throughput improvement with <1.5% loss on LongBench and 53.3% on AIME24.

Llama Mistral Qwen

SIG

HYP

arXiv cs.CL·May 19

AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning

AdaSwitch proposes a cloud-local collaborative paradigm where a local agent (small LLM) handles simple tasks and requests assistance from a cloud agent (large LLM) for complex reasoning. The adaptive mechanism detects local errors and dynamically switches. Evaluation on 7 benchmarks (mathematical reasoning, complex QA) shows performance improvement with reduced computational overhead.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.CL·May 19

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

ESI-Bench is a benchmark for embodied spatial intelligence spanning 10 task categories on OmniGibson. Experiments show active exploration outperforms passive approaches, but models fail primarily from "action blindness": poor action choices lead to poor observations and cascading errors. Models lack metacognition compared to humans.

Benchmarks Vision Reasoning

SIG

HYP

arXiv cs.CL·May 19

GIM: Evaluating models via tasks that integrate multiple cognitive domains

GIM is a benchmark of 820 original problems evaluating LLMs via integration of multiple cognitive domains (constraint satisfaction, state tracking, epistemic vigilance) rather than memorization or pure abstract reasoning. IRT calibration over >200k prompt-response pairs, 28 models, extensive study of compute vs capability trade-off across 11 models and 35 configurations.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.CL·May 19

Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems

Claim-level specificity control method for agentic systems. CSS decomposes answers into claims, proposes coarser backoffs, and emits each claim at the most specific calibrated admissible level. On LongFact, improves utility from 0.846 to 0.913 while retaining 0.938 specificity.

AI Agents Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers

The Loupe is a lightweight spatial gating module for hierarchical Vision Transformers designed for fine-grained visual classification. Inserted at an intermediate feature stage, it predicts a single-channel spatial mask via a small CNN and reweights activations. On CUB-200-2011, it improves Swin-Base from 88.36% to 91.72% and Swin-Tiny from 85.14% to 88.61% with <0.1% additional parameters.

Vision Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

PROF, a data curation method, combines Process Reward Models (PRM) and outcome rewards (ORM) to improve reinforcement learning on reasoning tasks. It filters training samples by keeping correct responses with strong process support and incorrect responses with weak process support, avoiding instability from direct PRM optimization.

Reinforcement learning Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

The threat of analytic flexibility in using large language models to simulate human data

arXiv study demonstrating that analytic choices (model selection, sampling parameters, prompt format, demographic data) materially affect the fidelity of "silicon samples" (synthetic datasets generated by LLMs). Across 252 configurations tested, correlations with human data range from r=.23 to r=.84, revealing a major risk of analytic flexibility.

Llama Evals AI safety

SIG

HYP

arXiv cs.AI·May 19

Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

New metric called Refusal Index (RI) measures LLMs' ability to refuse questions beyond their knowledge. RI correlates refusal probability with error probability using Spearman's rank correlation. Testing across 16 models and 5 datasets shows LLM refusal behavior remains fragile despite high factual accuracy.

Evals AI safety Alignment

SIG

HYP

arXiv cs.LG·May 19

Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning

Diagnostic study of catastrophic forgetting in continual learning using Sparse Autoencoders. Framework analyzes task-specific information evolution at latent concept level. Finding: most apparent concept-level forgetting is recoverable under linearity assumption; degradation stems from representational accessibility changes rather than complete information erasure.

Papers Reasoning Vision

SIG

HYP

arXiv cs.CL·May 19

General Preference Reinforcement Learning

New GPRL (General Preference Reinforcement Learning) method replaces scalar reward models with General Preference Model (GPM) using k skew-symmetric subspaces. Tested on Llama-3-8B-Instruct: 56.51% win rate AlpacaEval 2.0, outperforms SimPO and SPPO on Arena-Hard, MT-Bench, WildBench by preventing single-axis reward hacking.

Reinforcement learning Llama Alignment

SIG

HYP

arXiv cs.CL·May 19

Post-Trained MoE Can Skip Half Experts via Self-Distillation

ZEDA converts post-trained static MoE models into dynamic variants via self-distillation. On Qwen3-30B-A3B and GLM-4.7-Flash, the method eliminates 50% of expert FLOPs with marginal accuracy loss and achieves 1.20× end-to-end inference speedup.

Qwen Fine-tuning Infrastructure

SIG

HYP

arXiv cs.AI·May 19

SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

SSL4RL leverages self-supervised learning tasks (image rotation, masked patch reconstruction) as reward signals for reinforcement learning fine-tuning of vision-language models. The framework eliminates the need for human preference data and improves performance on vision-centric and vision-language reasoning benchmarks.

Vision Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

DocReward: A Document Reward Model for Structuring and Stylizing

DocReward is a document reward model evaluating structure and style of professional documents, independent of textual quality. Trained on DocPair (117K document pairs, 32 domains), it outperforms GPT-4 by 14.6 percentage points and effectively guides agents via RL toward higher structural and stylistic professionalism.

Reinforcement learning AI Agents Evals

SIG

HYP

arXiv cs.AI·May 19

Unlocking the Potential of Diffusion Language Models through Template Infilling

Template Infilling (TI) is a conditioning methodology for Diffusion Language Models that aligns structural anchors across the entire target response space, replacing prefix prompting. Evaluated on mathematical reasoning, code generation, and trip planning, TI improves performance by 9.40% and accelerates multi-token generation.

Code generation Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

Beacon is a diagnostic benchmark measuring sycophancy (bias toward user agreement) across 12 SOTA models. Authors decompose this bias into stable linguistic and affective sub-biases, proposing prompt-level and activation-level interventions to modulate them. Sycophancy emerges from a structural trade-off between truthfulness and polite submission.

Alignment AI safety Evals

SIG

HYP

arXiv cs.AI·May 19

Toward Robust Multilingual Adaptation of LLMs for Low-Resource Languages

LiRA, a lightweight fine-tuning framework, improves multilingual LLM adaptation for low-resource languages. It combines Arca (anchor-based alignment to English) and LaSR (language-aware semantic head) to stabilize representations and cross-lingual consistency. Positive results on retrieval, ranking, QA, and reasoning. Multilingual dataset (7 Asian languages) and code released open-source.

Fine-tuning RAG Embeddings

SIG

HYP

arXiv cs.AI·May 19

Needles in the Landscape: Semi-Supervised Pseudolabeling for Archaeological Site Discovery under Label Scarcity

Semi-supervised semantic segmentation model for predicting undiscovered archaeological site locations. Uses dynamic pseudolabeling and CRF-RNN to handle severe label scarcity. Matches LAMAP performance on DEM data, improves Dice scores on raw satellite imagery.

Vision Fine-tuning Evals

SIG

HYP

arXiv cs.AI·May 19

PyHealth 2.0: A Comprehensive Open-Source Toolkit for Accessible and Reproducible Clinical Deep Learning

PyHealth 2.0 is an open-source clinical deep learning toolkit reducing barriers to medical AI research. It unifies 15+ datasets, 20+ clinical tasks, 25+ models, and 5+ interpretability methods in a single framework supporting signals, imaging, and electronic health records. Delivers 39x speedup and 20x memory reduction, with 400+ community members.

Open source Code generation Evals

SIG

HYP

arXiv cs.AI·May 19

Spatially Aware Linear Transformer (SAL-T) for Particle Jet Tagging

SAL-T (Spatially Aware Linear Transformer) reduces quadratic complexity of transformers for jet tagging at the LHC. The linear architecture incorporates spatial partitioning based on kinematic features and convolutional layers. Results comparable to full-attention transformers with lower latency and resource usage.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

SonarSweep: Fusing Sonar and Vision for Robust 3D Reconstruction via Plane Sweeping

SonarSweep fuses sonar and vision for underwater 3D reconstruction via plane sweeping. The end-to-end deep learning framework overcomes single-modality limitations by adapting the plane sweep algorithm for cross-modal fusion. Results in simulation and real-world environments, public release of code and first synchronized stereo-camera and sonar dataset.

Vision Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks

OverEager-Gen is a benchmark measuring out-of-scope actions by autonomous coding agents on benign tasks. On Claude Code, removing the consent declaration raises the overeager rate from 0% to 17.1% (p=2.4×10⁻⁴). Benchmark of 500 validated scenarios testing 4 products (Claude Code, OpenHands, Codex CLI, Gemini CLI): rates 5.4–27.7% in permissive mode vs 0.2–4.5% in ask-to-continue framework.

AI Agents Code generation AI safety

SIG

HYP

arXiv cs.LG·May 19

ReTAMamba: Reliability-Aware Temporal Aggregation with Mamba for Irregular Clinical Time Series Prediction

ReTAMamba proposes a Mamba-based architecture for predicting irregular clinical time series. The model estimates observation reliability from missingness and elapsed time, integrates short/long-term information via Chronological Weaving, and uses a budgeted token router. On MIMIC-IV, eICU, and PhysioNet 2012, AUPRC gains of 7.51%, 7.80%, and 10.15% respectively.

Benchmarks Reasoning Papers

SIG

HYP

arXiv cs.LG·May 19

A Theory of Training Profit-Optimal LLMs

Economic model combining scaling laws and microeconomic theory to characterize profit optimization in LLM training. Analyzes how model size, token budget, and computational costs interact. In compute-bound regime, optimal spending tracks hardware efficiency (FLOPs/$) near-linearly. In data-bound regime, it scales as D²/E.

Benchmarks Papers Business

SIG

HYP

arXiv cs.AI·May 19

Convergence of Multiagent Learning Systems for Traffic control

Theoretical analysis of convergence for MARL algorithms in urban traffic control. Authors formalize stability of multi-agent systems using independent Q-learning on each traffic signal, extending single-agent asynchronous value iteration convergence proofs to the multi-agent case via stochastic approximation methods.

Multi-agent Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·May 19

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Evo-Memory is a benchmark for evaluating self-evolving memory in LLM agents. It structures data into sequential task streams and tests 10+ memory modules across 10 datasets. Authors propose ExpRAG for experience reuse and ReMem, an action-think-memory refine pipeline for continuous improvement.

AI Agents Benchmarks RAG

SIG

HYP

arXiv cs.AI·May 19

Two-Dimensional Quantization for Geometry-Aware Audio Coding

Q2D2 (Two-Dimensional Quantization) is a novel quantization scheme for neural audio codecs. It projects feature pairs onto structured 2D grids (hexagonal, rhombic, rectangular) to improve compression efficiency, token rates and codebook utilization while maintaining state-of-the-art reconstruction quality across speech, audio and music domains.

Code generation Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

BlendedNet++: A dataset and benchmark for field-resolved aerodynamics and inverse design of blended wing body aircraft

BlendedNet++ is a dataset of 12,492 Blended Wing Body (BWB) aircraft geometries with RANS simulations for aerodynamic field prediction. Authors benchmark 5 deep learning architectures (Transolver best) and propose a generative inverse design pipeline using conditional diffusion models, validated by CFD with R² > 0.99.

Benchmarks Papers Code generation

SIG

HYP

arXiv cs.LG·May 19

GPU-Accelerated Deep Learning for Heatwave Prediction and Urban Heat Risk Assessment

GPU-accelerated deep learning framework for next-day urban thermal prediction and heatwave risk assessment. ConvLSTM with mixed loss function on MODIS and Open-Meteo data (Sarajevo): MAE=0.2293, RMSE=0.3089, R²=0.8877. Generates city heat risk maps.

Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 19

CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

CodeBind introduces a multimodal alignment framework using shared-specific compositional codebooks. The method decomposes representations into semantic shared components and modality-unique components, validated across 9 modalities (text, image, video, audio, depth, thermal, tactile, 3D point cloud, EEG) achieving state-of-the-art performance in classification and retrieval tasks.

Embeddings Vision Robotics

SIG

HYP

arXiv cs.CL·May 19

Scalable Environments Drive Generalizable Agents

Position paper arguing that generalizable agents require environment scaling—expanding the distribution of executable rule-sets agents interact with—beyond trajectory or task scaling within fixed benchmarks. Proposes unified taxonomy separating trajectory, task, and environment scaling; synthesizes construction paradigms (programmatic generators vs generative world models) for scalable environments.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

Complete first-order analysis of gradient dynamics in transformer attention heads under cross-entropy training. Authors establish an advantage-based routing law and responsibility-weighted value updates, showing that optimization creates Bayesian manifolds implementing in-context probabilistic reasoning.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models

An arXiv study challenges the assumption that Mixture of Experts models achieve domain specialization through sparse routing. The COMMITTEEAUDIT framework reveals a domain-invariant "Standing Committee"—a compact coalition of experts capturing most routing mass across domains, layers, and budgets. Peripheral experts handle domain-specific knowledge alone.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Double-Calibration: Towards Reliable LLMs via Calibrating Knowledge and Reasoning Confidence

DoublyCal, a framework to improve LLM reliability by combining Knowledge Graphs and uncertainty calibration. A lightweight proxy model generates KG evidence with calibrated confidence, guiding a black-box LLM toward more accurate and well-calibrated predictions. Tested on knowledge-intensive benchmarks with reduced token costs.

RAG Reasoning AI safety

SIG

HYP