Page 6 of 192

AllHigh signalRecent

7679 articles

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

ProfBench introduces a benchmark of 7000+ response-criterion pairs evaluated by domain experts (Physics/Chemistry PhDs, Finance/Consulting MBAs). Top models like GPT-5-high achieve only 65.9% performance. Authors develop robust LLM-Judges reducing evaluation costs by 2-3 orders of magnitude.

Benchmarks Evals GPT

SIG

HYP

arXiv cs.AI·May 19

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

RLBFF combines human feedback and verifiable rewards for reward model training. The method extracts binary principles from natural language feedback (e.g., accuracy, code readability) and uses them as entailment tasks. Models achieve 86.2% on RM-Bench and 81.4% on JudgeBench (#1 as of September 2025). Qwen3-32B aligned with RLBFF matches o3-mini and DeepSeek R1 at <5% inference cost.

Reinforcement learning Evals Alignment

SIG

HYP

arXiv cs.AI·May 19

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

arXiv paper shows Mixture-of-Experts (MoE) models outperform dense architectures under strictly equal resource constraints (identical total parameters, training compute, data budget). Researchers identify an optimal activation rate region consistent across model sizes. Validated on ~200 2B-scale and 50 7B-scale models (50 trillion tokens processed).

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.CL·May 19

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

ProfBench is a benchmark of 7000+ response-criterion pairs evaluated by human experts in physics, chemistry, finance, and consulting. Authors propose robust LLM-judges reducing evaluation cost by 2-3 orders of magnitude. GPT-5-high achieves 65.9% performance, revealing significant gaps between proprietary and open-weight models.

Benchmarks Evals GPT

SIG

HYP

arXiv cs.CL·May 19

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

FinAuditing is a financial auditing benchmark built from 1,102 real XBRL instances (33k tokens average). It evaluates 13 LLMs on three tasks: Financial Semantic Matching, Financial Relationship Extraction, and Financial Mathematical Reasoning. Results reveal substantial gaps in concept retrieval and cross-document reasoning.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·May 19

Scaling Laws for Code: A More Data-Hungry Regime

Empirical study of 117 experiments (0.2B–3.8B parameters, 2B–128B tokens) on scaling laws for Code LLMs. Code requires higher data-to-parameter ratio than natural language. Farseer law outperforms Chinchilla. Code-NL mixtures benefit NL under resource constraints but harm it at higher compute budgets.

Code generation Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

ScaleLogic, a synthetic logical reasoning framework, demonstrates that RL can teach long-horizon reasoning to LLMs. Training compute follows a power law with proof depth (T ∝ D^γ, R² > 0.99), with exponent γ increasing from 1.04 to 2.60 as logical expressiveness grows. Models trained on more expressive logics transfer better (+10.66 points on downstream benchmarks).

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

arXiv paper demonstrates that Mixture-of-Experts (MoE) models can outperform dense architectures under strictly equal resource constraints (parameters, training compute, data). Researchers identify an optimal activation rate region consistent across model sizes. Validated on ~200 2B-scale and 50 7B-scale models (50 trillion tokens processed).

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning

BioProAgent combines LLMs with finite state machines to plan biological experiments in wet-labs. The system enforces a Design-Verify-Rectify workflow and reduces token consumption by ~6× through symbolic abstraction. On BioProBench, it achieves 95.6% physical compliance versus 21.0% for ReAct.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Reinforcement Learning for LLM Post-Training: A Survey

Comprehensive survey of reinforcement learning post-training methods for LLMs. Unifies RLHF (DPO), RLVR (PPO, GRPO) and SFT within a single policy gradient framework. Detailed technical analysis of offline and iterative approaches with standardized notation for direct comparison.

Reinforcement learning Alignment Papers

SIG

HYP

arXiv cs.AI·May 19

WELD: The First Naturalistic Long-Period Small-Team Workplace Emotion Dataset for Ubiquitous Affective Computing

WELD is the first emotion dataset in naturalistic workplace context spanning 30.1 months (Nov 2021–May 2024) with 49 employees from a Chinese software company. 733,780 seven-class facial-expression probability vectors validate three established phenomena and reveal six asymmetric emotional regimes. Exposes FER model bias: over-prediction of 'angry' on neutral Asian faces (0.194 vs 0.05).

Vision Evals AI safety

SIG

HYP

arXiv cs.AI·May 19

An AI system to help scientists write expert-level empirical software

ERA, an AI system combining LLM and Tree Search, automatically generates expert-level scientific software. It discovered 40 novel bioinformatics methods outperforming top human-developed approaches, generated 14 epidemiological models surpassing the CDC ensemble for COVID-19 hospitalization forecasting, and produced expert-level solutions for geospatial analysis and neural activity prediction.

AI Agents Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

GenoMAS is an LLM-based multi-agent framework for gene expression analysis. Six specialized agents orchestrated via typed message-passing protocols combine structured workflows with autonomous adaptability. On GenoTEX benchmark: 89.13% correlation for preprocessing, F1 of 60.48% for gene identification (+10.61% and +16.85% vs prior art).

Multi-agent AI Agents Code generation

SIG

HYP

arXiv cs.AI·May 19

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

Study on latency of computer-use agents on OSWorld: LLM calls for planning and reflection dominate total time. 16 agents tested require 2.7–4.3× more steps than optimal human trajectories. Each successive step takes 3× longer than initial steps.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees

Distillation of tabular foundation models (TabICLv2) into boosted trees (XGBoost/CatBoost) for ultra-fast CPU inference. Solves soft target collapse via stratified out-of-fold labeling. Across 153 datasets: 0.882 macro-mean AUC (96.5% of teacher) at 1.9 ms on CPU, 38–860x speedup. Open-sourced as TabTune library.

Fine-tuning Benchmarks Open source

SIG

HYP

arXiv cs.AI·May 19

GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets

GAMMA is a quantizer-agnostic mixed-precision framework that automatically allocates bit precision per module without quantization-aware training. Using teacher-forced hidden-state reconstruction and integer programming, it achieves +12.99 Avg. over fixed baselines on Llama/Qwen 8B-32B, matching 3-bit quality at 2.5-bit average.

Llama Qwen Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Qumus: Realization of An Embodied AI Quantum Material Experimentalist

Qumus is the first embodied AI quantum materials experimentalist: an autonomous robotic mini-laboratory capable of hypothesis generation, protocol planning, and experimental execution on 2D quantum materials. It achieved first-time AI creation of graphene and fabrication of atomically thin field-effect transistors via van der Waals stacking, with closed-loop error correction.

AI Agents Multi-agent Robotics

SIG

HYP

arXiv cs.AI·May 19

Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents

Large-scale study of 64,380 SWE-bench runs across 126 agent configurations (43 frameworks × LLMs). Behavioral rules derived from single frameworks do not transfer: the same signal (e.g., error rate) correlates positively with issue resolution in 47 configs and negatively in 48. Framework identity explains 64% of variance vs. 10% for LLM family.

AI Agents Benchmarks Code generation

SIG

HYP

arXiv cs.CL·May 19

Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs

FireFly generates verified tool-call data for training agents from real MCP servers. The pipeline inverts standard synthesis: explores real APIs via DAG structures, then generates tasks backward from observed outcomes. 5,144 verified tasks across 240 servers and 993 tools. A 4B model trained with GRPO matches Claude Sonnet on held-out test set.

AI Agents MCP Code generation

SIG

HYP

arXiv cs.CL·May 19

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Hilbert-Geo introduces a unified formal framework for solid geometry via Parse2Reason: parsing into CDL (Conditional Description Language) then reasoning with theorem bank. Achieves 77.3% on SolidFGeo2k and 84.1% on MathVerse-Solid, outperforming Gemini-2.5-pro (54.2%) and GPT-5 (62.9%). Two annotated datasets: SolidFGeo2k and PlaneFGeo3k.

Reasoning Vision Benchmarks

SIG

HYP

arXiv cs.AI·May 19

MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair

MemRepair is a memory-augmented agentic framework for repository-level vulnerability repair. It combines three memory layers (History-Fix, Security-Pattern, Refinement-Trajectory) with an iterative refinement loop. Evaluated on SEC-Bench, PatchEval, and Multi-SWE-bench, MemRepair achieves 58.0%, 58.2%, and 30.58% resolution rates, outperforming OpenHands, SWE-agent, and InfCode-C++.

AI Agents Code generation AI safety

SIG

HYP

arXiv cs.AI·May 19

DBES: A Systematic Benchmark and Metric Suite for Evaluating Expert Specialization in Large-Scale MoEs

DBES is a diagnostic framework for evaluating expert specialization in Mixture-of-Experts models. Five theoretically grounded metrics measure domain isolation and routing specialization. Testing on Qwen, DeepSeek, and GLM reveals distinct specialization paradigms. Targeted post-training on specialized expert paths improves performance by 66–94% using only 15% of original training resources.

Benchmarks Qwen DeepSeek

SIG

HYP

arXiv cs.CL·May 19

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

SynPro, a synthetic data generation framework, helps LLMs learn more thoroughly from limited organic corpora via rephrasing and reformatting. Optimized with RL, it unlocks 3.7-5.2x more effective tokens than simple repetition on 400M and 1.1B models, even surpassing the non-data-bound oracle at 1.1B scale. Code open-sourced.

Reinforcement learning Benchmarks Open source

SIG

HYP

arXiv cs.CL·May 19

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Injecting mathematically wrong drafts from a smaller model (Qwen2.5-Math-1.5B) mismatched to the current problem into a stronger learner's (Mathstral-7B) GRPO context outperforms standard on-policy GRPO. On MATH-500, the mismatched-wrong variant reaches 71.98% (highest published result for this model), +1.62pp vs matched-wrong variant, without SFT or reward models.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

PARALLAX reveals that 4 of 6 major hallucination detection benchmarks embed the ground-truth answer in the prompt, allowing a naive baseline (TxTemb) to achieve near-perfect detection without access to model internals. Evaluation of 22 methods across 12 open-source models: most fail under controlled conditions, except SAPLMA and DRIFT (supervised probes on upper-layer hidden states).

Benchmarks Evals AI safety

SIG

HYP

arXiv cs.AI·May 19

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

LLM-based multi-agent pipelines flip to incorrect answers under simulated peer disagreement (yield). Contrary to common attribution, RLHF is not responsible: pretrained base models exhibit the same substitution pattern. Activation patching localizes corruption to a narrow mid-layer window. A single correctly-arguing dissenter reduces yield by 54-73 percentage points.

Multi-agent Alignment Reasoning

SIG

HYP

arXiv cs.AI·May 19

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Orthrus unifies autoregressive LLM fidelity with parallel diffusion token generation via a dual-architecture framework. A lightweight trainable module augments a frozen Transformer to enable parallel generation while maintaining exact autoregressive quality. Achieves up to 7.8x speedup with O(1) memory overhead.

Reasoning Code generation Infrastructure

SIG

HYP

arXiv cs.AI·May 19

BlendedNet++: A dataset and benchmark for field-resolved aerodynamics and inverse design of blended wing body aircraft

BlendedNet++ is a dataset of 12,492 Blended Wing Body (BWB) aircraft geometries with RANS simulations for aerodynamic field prediction. Authors benchmark 5 deep learning architectures (Transolver best) and propose a generative inverse design pipeline using conditional diffusion models, validated by CFD with R² > 0.99.

Benchmarks Papers Code generation

SIG

HYP

arXiv cs.CL·May 19

GIM: Evaluating models via tasks that integrate multiple cognitive domains

GIM is a benchmark of 820 original problems evaluating LLMs via integration of multiple cognitive domains (constraint satisfaction, state tracking, epistemic vigilance) rather than memorization or pure abstract reasoning. IRT calibration over >200k prompt-response pairs, 28 models, extensive study of compute vs capability trade-off across 11 models and 35 configurations.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.AI·May 19

LEGO: An LLM Skill-Based Front-End Design Generation Platform

LEGO is a modular platform for LLM-based digital front-end design generation. It decomposes the flow into 6 steps and extracts 42 reusable circuit skills. On 41 hard VerilogEval v2 problems where GPT-5.2-codex fails, LEGO achieves 80.5% Pass@1 vs 0% baseline, outperforming hierarchy-verilog (+14.6%) and VerilogCoder (+2.5%).

Code generation AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Can LLM Agents Be CFOs? Benchmarking Long-Horizon Resource Allocation in an Uncertain Enterprise Environment

EnterpriseArena, a 132-month CFO simulator, benchmarks LLM agents' ability to allocate resources over long horizons under uncertainty. Tests across 23 models and 4 frameworks: only 15.4% of trials complete the full horizon. Larger models do not reliably outperform smaller ones. Reveals critical capability gap in managing binding commitments under partial observability.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 19

Prompts Don't Protect: Architectural Enforcement via MCP Proxy for LLM Tool Access Control

LLMs used as autonomous agents select unauthorized tools despite explicit instructions. Study across Qwen 2.5 7B, Llama 3.1 8B, and Claude Haiku 3.5 shows an MCP proxy with attribute-based access control (ABAC) reduces unauthorized invocation rate to 0%, versus 11-18% for prompt-based restrictions. Architectural enforcement, not prompting, is required for reliable tool access control.

AI Agents MCP AI safety

SIG

HYP

arXiv cs.AI·May 19

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Hilbert-Geo introduces a unified formal framework for solid geometry via Parse2Reason: parsing into Conditional Description Language (CDL) then reasoning with theorem bank. Achieves 77.3% on SolidFGeo2k and 84.1% on MathVerse-Solid, outperforming Gemini-2.5-pro (54.2%) and GPT-5 (62.9%). Two expert-annotated datasets: SolidFGeo2k and PlaneFGeo3k.

Reasoning Vision Benchmarks

SIG

HYP

arXiv cs.AI·May 19

TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?

TeleCom-Bench is a 22,678-sample benchmark evaluating 8 LLMs on real telecom tasks (intent recognition, entity extraction, root cause analysis, solution generation). Models achieve 90% on linguistic tasks but collapse to 30% on procedural execution, revealing an 'Execution Wall': LLMs diagnose well but fail as field engineers.

Benchmarks Reasoning AI Agents

SIG

HYP

arXiv cs.AI·May 19

Stable Audio 3

Stable Audio 3 is a family of latent diffusion models (small, medium, large) for variable-length audio generation and editing. Models use a novel semantic-acoustic autoencoder and adversarial post-training to generate music and sounds in under 2s on H200 or seconds on MacBook Pro M4. Small and medium weights are released.

Open source

SIG

HYP

arXiv cs.AI·May 19

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

SynPro, a synthetic data generation framework, helps LLMs learn more thoroughly from limited organic corpora through rephrasing and reformatting operations. Optimized via reinforcement learning, it unlocks 3.7-5.2x more effective tokens than simple repetition on 400M and 1.1B models, even surpassing the non-data-bound oracle at 1.1B scale.

Reinforcement learning Benchmarks Open source

SIG

HYP

arXiv cs.AI·May 19

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

OSCAR quantizes KV caches to INT2 for long-context LLMs by estimating attention-aware covariance structures offline. Tested on Qwen3 (4B–32B) and GLM-4.7 (358B), it reduces accuracy gap to 1.42–3.78 points vs BF16, cuts memory by 8x and improves throughput by 7x. Custom INT2 kernel compatible with vLLM/SGLang.

Reasoning Benchmarks Infrastructure

SIG

HYP

arXiv cs.AI·May 19

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

SaaSBench is the first benchmark to evaluate AI agents in enterprise SaaS engineering. It contains 30 complex tasks across 6 SaaS domains with 8 programming languages, 6 databases, and 13 frameworks. Experiments show >95% of failures occur before business logic: agents struggle to configure and integrate multi-component systems.

AI Agents Code generation Benchmarks

SIG

HYP

arXiv cs.AI·May 19

ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse

ContraFix is an agentic framework for automated vulnerability repair combining differential runtime evidence and skill reuse. On SEC-Bench (C/C++) and PatchEval (Go, Python, JavaScript), it achieves 84.0% and 73.8% resolution rates with GPT-4-mini, outperforming baselines while costing less than one-third of comparable approaches.

AI Agents Code generation Reasoning

SIG

HYP

arXiv cs.AI·May 19

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

FML-Bench is a benchmark of 18 ML tasks across 10 domains evaluating 6 AI research agents. Key findings: strategy complexity alone does not ensure performance (greedy hill-climber matches tree-search); effectiveness depends on improvement opportunity structure; an adaptive agent detecting stagnation outperforms others. Includes 12 process-level behavioral metrics.

AI Agents Benchmarks Reasoning

SIG

HYP