Page 4 of 137

AllHigh signalRecent
5448 articles
arXiv cs.AI·

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

DecisionBench is a benchmark for evaluating emergent delegation in long-horizon multi-agent workflows. The substrate includes 11 models (7 vendor families), GAIA/tau-bench/BFCL tasks, and multi-axis metrics (quality, cost, latency, routing fidelity). Results show quality alone masks orchestration signals, and delivery channel dominates description content.

AI AgentsMulti-agentBenchmarks
SIG
82
HYP
15
arXiv cs.LG·

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

Study of 63 base models reveals hidden phase transition: below ~3.5B parameters, reasoning and truthfulness anticorrelate; above, they cooperate. Architecture, data curation, and training recipe independently shift this critical threshold. Width normalization eliminates anticorrelation; frontier models reach r=+0.72. Open-source steering tool and diagnostic dashboard released.

BenchmarksAlignmentReasoning
SIG
82
HYP
25
arXiv cs.AI·

Qumus: Realization of An Embodied AI Quantum Material Experimentalist

Qumus is the first embodied AI quantum materials experimentalist: an autonomous robotic mini-laboratory capable of hypothesis generation, protocol planning, and experimental execution on 2D quantum materials. It achieved first-time AI creation of graphene and fabrication of atomically thin field-effect transistors via van der Waals stacking, with closed-loop error correction.

AI AgentsMulti-agentRobotics
SIG
82
HYP
35
arXiv cs.AI·

DBES: A Systematic Benchmark and Metric Suite for Evaluating Expert Specialization in Large-Scale MoEs

DBES is a diagnostic framework for evaluating expert specialization in Mixture-of-Experts models. Five theoretically grounded metrics measure domain isolation and routing specialization. Testing on Qwen, DeepSeek, and GLM reveals distinct specialization paradigms. Targeted post-training on specialized expert paths improves performance by 66–94% using only 15% of original training resources.

BenchmarksQwenDeepSeek
SIG
82
HYP
18
arXiv cs.AI·

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

GenoMAS is an LLM-based multi-agent framework for gene expression analysis. Six specialized agents orchestrated via typed message-passing protocols combine structured workflows with autonomous adaptability. On GenoTEX benchmark: 89.13% correlation for preprocessing, F1 of 60.48% for gene identification (+10.61% and +16.85% vs prior art).

Multi-agentAI AgentsCode generation
SIG
82
HYP
18
arXiv cs.AI·

An AI system to help scientists write expert-level empirical software

ERA, an AI system combining LLM and Tree Search, automatically generates expert-level scientific software. It discovered 40 novel bioinformatics methods outperforming top human-developed approaches, generated 14 epidemiological models surpassing the CDC ensemble for COVID-19 hospitalization forecasting, and produced expert-level solutions for geospatial analysis and neural activity prediction.

AI AgentsReasoningCode generation
SIG
82
HYP
28
arXiv cs.AI·

WELD: The First Naturalistic Long-Period Small-Team Workplace Emotion Dataset for Ubiquitous Affective Computing

WELD is the first emotion dataset in naturalistic workplace context spanning 30.1 months (Nov 2021–May 2024) with 49 employees from a Chinese software company. 733,780 seven-class facial-expression probability vectors validate three established phenomena and reveal six asymmetric emotional regimes. Exposes FER model bias: over-prediction of 'angry' on neutral Asian faces (0.194 vs 0.05).

VisionEvalsAI safety
SIG
82
HYP
15
arXiv cs.AI·

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

ScaleLogic, a synthetic logical reasoning framework, demonstrates that RL can teach long-horizon reasoning to LLMs. Training compute follows a power law with proof depth (T ∝ D^γ, R² > 0.99), with exponent γ increasing from 1.04 to 2.60 as logical expressiveness grows. Models trained on more expressive logics transfer better (+10.66 points on downstream benchmarks).

Reinforcement learningReasoningBenchmarks
SIG
82
HYP
18
arXiv cs.AI·

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

RLBFF combines human feedback and verifiable rewards for reward model training. The method extracts binary principles from natural language feedback (e.g., accuracy, code readability) and uses them as entailment tasks. Models achieve 86.2% on RM-Bench and 81.4% on JudgeBench (#1 as of September 2025). Qwen3-32B aligned with RLBFF matches o3-mini and DeepSeek R1 at <5% inference cost.

Reinforcement learningEvalsAlignment
SIG
82
HYP
25
arXiv cs.AI·

SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition

SurgicalMamba, a Mamba2-based model, performs online surgical phase recognition with O(d) per-frame cost. Three components address domain-specific challenges: dual-path SSD separating long/short-term regimes, intensity-modulated stepping adapting effective rate, and state regramming enabling cross-channel mixing. SOTA results: 94.6%/82.7% on Cholec80, 89.5%/68.9% on AutoLaparo, 238.74 fps on single GPU.

ReasoningBenchmarksVision
SIG
82
HYP
15
arXiv cs.CL·

MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair

MemRepair is a memory-augmented agentic framework for repository-level vulnerability repair. It combines three memory layers (History-Fix, Security-Pattern, Refinement-Trajectory) with an iterative refinement loop. Evaluated on SEC-Bench, PatchEval, and Multi-SWE-bench, MemRepair achieves 58.0%, 58.2%, and 30.58% resolution rates, outperforming OpenHands, SWE-agent, and InfCode-C++.

AI AgentsCode generationAI safety
SIG
82
HYP
18