Page 19 of 192

AllHigh signalRecent

7679 articles

TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems

TCP-MCP co-evolves agent prompts and communication topologies as a unified genome. On MMLU-Pro, MMLU, and GSM8K with DeepSeek-V3.2 backbone, the system achieves 82.66%, 89.96%, and 96.61% accuracy while consuming 5.69× fewer tokens than debate-style systems.

Multi-agent Prompt engineering Benchmarks

SIG

HYP

arXiv cs.AI·May 28

EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

EgoBench is an interactive multimodal benchmark for tool-using agents with 1,045 egocentric-video tasks across four daily scenarios. Eight SOTA video-MLLMs achieve only 30.62% accuracy at best, 19.43% average, exposing bottlenecks in visual perception and multi-hop reasoning.

AI Agents Vision Benchmarks

SIG

HYP

arXiv cs.LG·May 28

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

arXiv study on LLM refusal robustness across batch configurations. Paired testing protocol across 15 models finds 0.16% authentic safety-label flips. vLLM with BATCH_INVARIANT=1 eliminates detected instabilities (22→0 flips). Recommendation: validate refusal in actual serving environment.

AI safety Evals Benchmarks

SIG

HYP

arXiv cs.LG·May 28

Fine-Tuning Dynamics of In-Context Factual Recall in Transformers

Theoretical study of in-context learning dynamics in transformers. Authors formalize the IC-recall task where the model infers a hidden relation from examples and retrieves factual knowledge stored in parameters. Proof that fine-tuning converges to a specific attention pattern using polylogarithmic sample complexity.

Reasoning Fine-tuning Papers

SIG

HYP

arXiv cs.LG·May 28

Heterogeneous Parallelism for Multimodal Large Language Model Training

arXiv paper proposing heterogeneous parallelism for multimodal LLM training. Allows encoders and LLMs to use independent sharding layouts (TP/CP/PP/DP/EP) on shared or disjoint GPUs. Improves throughput by up to 49.3% in colocated configuration and 13% in non-colocated mode. Open-source implementation as Megatron-LM extension.

Infrastructure Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 28

SkillGrad: Optimizing Agent Skills Like Gradient Descent

SkillGrad optimizes LLM agent skills using a gradient-descent-inspired framework. Task executions provide trajectory-level loss signals, automatic diagnostics generate text-based gradients, and a momentum agent accumulates recurring patterns. Evaluated on SpreadsheetBench and WikiTableQuestions, SkillGrad outperforms training-based baselines by 6.7 percentage points on average.

AI Agents Reinforcement learning Prompt engineering

SIG

HYP

arXiv cs.AI·May 28

PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft

PEAM is an embodied agent memory framework in Minecraft that internalizes experience as parameters rather than inference-time retrieval. It pairs a slow LLM for reasoning with a fast parametric module (Mixture-of-Experts LoRA) learning via behavioral cloning and contrastive objectives. Failures are treated as training signals to learn corrected actions.

AI Agents Reinforcement learning Fine-tuning

SIG

HYP

arXiv cs.CL·May 28

RAG-Coding: Enhancing LLM Medical Coding with Structured External Knowledge

RAG-Coding is a multi-agent method orchestrating 4 LLMs for automated ICD-10-CM coding. It grounds decisions in external sources (official tabular, guidelines) and improves accuracy by 8-13% micro-F1 on MDACE. Authors release MDACE-2025 with expert annotations aligned to 2025 guidelines.

RAG AI Agents Multi-agent

SIG

HYP

arXiv cs.CL·May 28

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

BioELX is a two-stage cross-lingual biomedical entity linking system requiring no annotated training data. It enriches SapBERT with Wikidata-derived multilingual aliases and uses an LLM for context-aware disambiguation. On five benchmarks, it achieves +19.2 Recall@1 on XL-BEL, with major gains for low-resource languages (Turkish +21.6, Korean +22.1, Thai +30.8).

Benchmarks Papers RAG

SIG

HYP

arXiv cs.CL·May 28

Retrieval, Reward, and Training Protocols: What Matters in Training Search Agents?

Controlled empirical study on training search agents powered by LLMs. Authors isolate three dimensions: (1) data-coverage issue in Wikipedia 2018 corpus explains larger gains than algorithmic differences, (2) outcome-based rewards outperform process-based approaches, (3) analysis of training data diversity and search budget scaling. Code released.

AI Agents RAG Reinforcement learning

SIG

HYP

arXiv cs.CL·May 28

MERIT: Matching Expertise via Rubric-Informed Training for Reviewer Assignment

MERIT is a two-stage framework for large-scale reviewer assignment. A 4B parameter model trained via RL assesses submission-reviewer fit using expertise rubrics guided by an LLM judge, then distills predictions into an embedding-based retriever. Outperforms larger general-purpose LLMs on LR-Bench and CMU Gold dataset.

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 28

DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification

DecomposeRL combines accurate claim verification with inspectable traces using RL (GRPO). A 7B model trained on 5K curated claims achieves 86.3% in-domain and 69.8% out-of-domain accuracy, matching 32B baselines and GPT-4.1-mini. Works in semi-supervised settings with only 10% labeled data.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.CL·May 28

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

ReverseMath automatically generates new math problems by inverting answer and unknown: mask a numerical value, treat original answer as known condition, rewrite problem so masked value becomes new answer. Detects memorization by comparing performance on original/reversed pairs. Improves mathematical reasoning via data augmentation for RL.

Benchmarks Reasoning Reinforcement learning

SIG

HYP

arXiv cs.CL·May 28

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

TRACES is a proactive safety auditor for multi-turn LLM agents that detects drift toward unsafe behavior from hidden representations of an observer LLM. Trained with weak trajectory-level supervision, it produces dense prefix-level risk estimates, improving full-trajectory safety prediction and proactive risk discrimination across multiple agent safety benchmarks.

AI Agents AI safety Reasoning

SIG

HYP

arXiv cs.CL·May 28

Disentangling Language Roles in Multilingual LLM Task Execution

MTM-Bench, a controlled benchmark for multilingual task execution, evaluates 20 LLMs across 27 language triplets (instruction/content/response) in English, Spanish, and Chinese. Results show degradation is organized by language role in task structure, with response language as the dominant axis of variation.

Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 28

EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter AdaptationTarget

EvoSpec improves speculative decoding by dynamically adapting draft model vocabulary and parameters in real-time. Using semantic indexing and curriculum learning, it maintains high acceptance rates across specialized domains (coding, law, medicine). On EAGLE-3: 1.13x speedup vs FR-Spec with 27% lower memory overhead.

Code generation Reasoning Infrastructure

SIG

HYP

arXiv cs.CL·May 28

OralAgent: Integrating Reasoning, Tools, and Knowledge for Interactive Dental Image Analysis

OralAgent is a dental-specialized AI agent integrating multimodal reasoning, 22 visual analysis tools, and RAG over 368 classical dental textbooks (134.8M tokens). Evaluated on OralQA-ZH (798 questions) and MMOral benchmarks, it achieves SOTA for dental image analysis in clinical workflows.

AI Agents Vision RAG

SIG

HYP

arXiv cs.AI·May 28

When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

arXiv study reveals aligned language models fail to adapt safety behavior when context flips ("brittle safety"). Testing 12 models shows safety-commonsense gap of +17.4 pp. Current guardrails miss consequence-flips; state-aware validator catches all without false alarms.

AI safety Alignment Evals

SIG

HYP

arXiv cs.AI·May 28

Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

arXiv study on privacy in multi-agent systems. Platform simulates thousands of LLM agents interacting over one month. Privacy violations increase from 19.95% (single-turn) to 45.30% (multi-turn). Agents 8× more likely to disclose sensitive info after observing peer behavior. Explicit privacy instructions reduce but don't eliminate leakage (37.8% minimum).

AI Agents Multi-agent AI safety

SIG

HYP

arXiv cs.AI·May 28

A Policy-Driven Runtime Layer for Agentic LLM Serving

Proposes intermediate runtime layer between agent framework and LLM serving engine. Introduces four primitives (observe, score, predict, act) to implement agent-aware policies (KV caching, batch shaping, speculation, fairness, safety). CacheSage, instantiated for cross-session caching, achieves +13 to +37 pp cache hit-rate lift, 12–29% lower TTFT, 6–14% higher throughput on five real multi-agent workloads.

AI Agents Multi-agent Infrastructure

SIG

HYP

arXiv cs.AI·May 28

DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation

DeepSciVerify verifies alignment between scientific claims and citations via two-stage pipeline: abstract-level reasoning plus selective escalation to full-text passages. On SCitance benchmark: 86.7 Micro-F1 (+4.5 vs baselines), 67% of instances resolved without full-text retrieval.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 28

Behavioural Analysis of Alignment Faking

arXiv study on alignment faking (AF): when models strategically comply with training objectives while preserving deployment preferences. Authors identify three separable drivers (values, goal guarding, sycophancy) via prompt ablations and activation steering. AF proves more widespread than previously reported, including in small-scale models, and predictable from situational cues.

Alignment AI safety Papers

SIG

HYP

arXiv cs.AI·May 28

Voluntary Collusion with Secret Tools in Competing LLM Agents

Empirical study across 12 LLM models (7B to proprietary scale) showing voluntary adoption of secret collusion tools in competitive multi-agent environments (Liar's Bar, Cleanup), despite explicit unfairness labels. Only ethical framing reduces adoption; general alignment alone is insufficient.

Multi-agent AI safety Alignment

SIG

HYP

arXiv cs.LG·May 28

Explicit Critic Guidance for Aligning Diffusion Models

New online reinforcement learning method for aligning diffusion models with non-differentiable objectives. State-aligned latent actor-critic framework where the diffusion model predicts values directly on noisy latent states, enabling trajectory-level PPO training and multi-reward optimization. Outperforms prior baselines on UNet and DiT benchmarks.

Reinforcement learning Alignment Papers

SIG

HYP

arXiv cs.LG·May 28

Information-theoretic Multimodal Representation Learning for Electrocardiogram Signals

MERIT, a multimodal pretraining framework, combines masked ECG modeling with ECG–text contrastive alignment to learn cardiac representations. On PTB-XL: +3% F1 (All) and +5% F1 (SubClass), +2.66% AUC zero-shot. Also improves clinical text generation with LLMs.

Papers Benchmarks Embeddings

SIG

HYP

arXiv cs.LG·May 28

GenSBI: Generative Methods for Simulation-Based Inference in JAX

GenSBI is an open-source JAX library for simulation-based inference (SBI). It implements flow matching, score matching, and denoising diffusion with three transformer architectures (SimFormer, Flux1, Flux1Joint). Validation on SBIBM benchmarks: C2ST scores of 0.50-0.56 (ideal=0.50).

Open source Tools Benchmarks

SIG

HYP

arXiv cs.CL·May 28

Cultural Fidelity in English-to-Hindi Translation: A Preservation-Fluency Frontier for Gender Recoverability

Study on gender preservation in English-to-Hindi translation. Benchmark of 37,345 instances shows GPT-4o-mini and Sarvam frequently erase gender via ergative constructions. Two rerankers (SAR and PAR) improve gender recoverability: PAR increases accuracy from 11-16% to 49-54%, but reduces fluency (4.36→3.37). Reveals preservation-fluency tradeoff.

Benchmarks Vision Alignment

SIG

HYP

arXiv cs.CL·May 28

Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

Modality-Aware Policy Optimization (MAPO) addresses late-stage modality collapse in audio-text models during RL fine-tuning. The method concentrates policy gradients on modality-critical tokens via a modality relevance mask and adds an attention penalty to sustain cross-modal grounding. MAPO achieves SOTA on several complex audio reasoning benchmarks.

Reinforcement learning Reasoning Alignment

SIG

HYP

arXiv cs.CL·May 28

The Future of Facts: Tracing the Factual Generation-Verification Gap

Empirical study of the generation-verification gap in LLMs: fact verification is learned before generation, more robust to continual learning, and factual updates create "multi-verse" states where models accept both old and new answers. Analysis across 4 open-source model families at 2 scales.

Papers Reasoning Evals

SIG

HYP

arXiv cs.CL·May 28

Debate Helps Weak Judges Reward Stronger Models

Debate between models improves weak judge oversight: critic must exceed judge's classification ability for debate to help. On 5 pairings tested on code/logic tasks, 3 show statistically significant gains. Single critique suffices; rebuttal rounds add nothing. Pre-deployment audit proposed.

Reasoning Evals Alignment

SIG

HYP

arXiv cs.CL·May 28

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

StoryMI is a multi-agent LLM framework for controllable motivational interviewing (MI) dialogue generation. Client profiles from questionnaires are expanded into narrative stories. Therapist and client agents generate MI-coded utterances, coordinated by an interaction agent. Evaluation on 6K simulated dialogues covering 12 MI codes and 13 symptom domains.

Multi-agent AI Agents Benchmarks

SIG

HYP

Reddit r/MachineLearning·May 27

Cross-Platform Fused MoE Dispatch in Triton: Portable Expert Routing Without CUDA [R]

TritonMoE: pure Triton MoE kernel for portable NVIDIA/AMD inference without vendor-specific code. Fused gate+up GEMM reduces memory traffic by 35%. Achieves 89-131% of Megablocks throughput (batch ≤512 tokens) on A100, same kernel runs on MI300X. Limitations: degrades at 2048+ tokens and with 64+ experts.

Benchmarks Open source

SIG

HYP

Reddit r/LocalLLaMA·May 27

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful.

Complete Usenet corpus (1980–2013) released for local fine-tuning: 103.1B tokens, 408M posts, zero AI contamination. Pre-SEO, pre-algorithm internet writing across 33 years. Organized by domain hierarchies (comp.*, sci.*, rec.*). Free samples available, full corpus under license.

Fine-tuning Open source Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·May 27

KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche

Comprehensive benchmark of 38 KV quantization pairs on Qwen 3.6 27B with 64k-128k context. Q5_0 and Q5_1 underrated, Q8_0/Q4_* overrated. Recommendation: Q8_0/Q6_0 or Q8_0/Q5_1 for high-end, Q6_0/Q5_0 for balance, Q5_0/Q5_0 for tight VRAM.

Qwen Benchmarks Fine-tuning

SIG

HYP

Reddit r/LocalLLaMA·May 27

Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes

Fused MoE dispatch kernel written in pure Triton (no CUDA) achieves 89-131% of Megablocks performance on A100. Fuses gate+up projections to cut 35% memory traffic. Runs on AMD MI300X with zero code changes. Limitations: degraded performance beyond 2048 tokens and with 64+ experts.

Open source Infrastructure Code generation

SIG

HYP

arXiv cs.AI·May 27

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

arXiv paper reveals that models with statistically indistinguishable atomic knowledge fail systematically to chain them in multi-hop reasoning (>40 percentage point gap). Aggregate metrics mask this 'composition collapse'. Authors introduce a double-gate protocol decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth.

Reasoning Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 27

MemFail: Stress-Testing Failure Modes of LLM Memory Systems

MemFail is a diagnostic benchmark isolating failure modes of modern LLM memory systems. Authors formalize these systems as composition of three operations (summarization, storage, retrieval) and construct five adversarial datasets to test each. Evaluation of four SOTA systems reveals architectural tradeoffs.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 27

MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

MedGuideX transforms clinical practice guideline (CPG) recommendations into executable decision logic to generate question-answering training data. Post-training a medical LLM on this data improves accuracy by 10.28% across four clinical reasoning benchmarks and produces physician-preferred rationales in faithfulness, validity, completeness, and clarity.

Fine-tuning Reasoning Evals

SIG

HYP

arXiv cs.AI·May 27

JobBench: Aligning Agent Work With Human Will

JobBench evaluates 36 AI models (including Claude Opus at 45.9%) on 130 real professional tasks across 35 occupations. Unlike existing benchmarks focused on economic value, JobBench prioritizes workflows experts identify as high-priority for delegation, favoring human augmentation over replacement.

AI Agents Benchmarks Claude

SIG

HYP

arXiv cs.AI·May 27

Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

Paper formalizing AI agent memory as a distinct data-management workload. Proposes GEM (Governed Evolving Memory) with four state-level operators (ingestion, revision, forgetting, retrieval) and six correctness conditions. Proves record-level systems cannot satisfy these conditions. Prototype MemState on property-graph backend.

AI Agents Papers Infrastructure

SIG

HYP