Topic

#Benchmarks

In AI, benchmarks are standardized test suites that objectively measure and compare model performance across defined tasks. For example, MMLU evaluates language models on question answering across more than 50 academic subjects.

40Articles

3Sources

74Avg. signal

arXiv cs.CL·Jun 18

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

Benchmark of 1,200 clinical documents with 9,184 uncertainty annotations across five levels. LLMs poorly preserve uncertainty expressions (less than 50% of cases) and struggle with nuanced distinctions between adjacent levels. Reveals a failure mode missed by standard metrics.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.CL·Jun 18

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

RPCL, a training-only framework for multimodal emotion-cause pair extraction, improves pair-confidence robustness. Using margin constraints and contextual corruption, it increases Pair F1 by 2.58–2.83 points on ECF/MECAD/MEC4 without changing inference.

Papers Benchmarks Vision

SIG

HYP

arXiv cs.AI·Jun 18

CEO-Bench: Can Agents Play the Long Game?

CEO-Bench evaluates agents' ability to handle complex long-horizon tasks by simulating a 500-day startup operation. The agent manages pricing, marketing, budgeting through a Python interface. Only Claude Opus 4.8 and GPT-5.5 exceed the $1M starting balance, neither consistently profitable.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·Jun 18

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

ForecastBench-Sim is a forecasting benchmark built on Freeciv game simulations. Models receive a structured game state and predict hidden future states; the benchmark continues the simulation to score forecasts. Enables questions at arbitrary time horizons, counterfactual worlds, and rare events.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 18

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

Local de-identification framework for educational dialogues. Two-stage cascade: union proposer (lightweight encoders + deterministic rules) generates PII candidates, then binary Redact/Keep reviewer uses dialogue context and speaker role. Achieves 0.958 macro F1 on math tutoring transcripts, outperforms commercial API (0.706) and local LLM baseline (0.767), runs on single laptop.

RAG AI safety Papers

SIG

HYP

arXiv cs.CL·Jun 18

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

JetFlow improves speculative decoding by combining parallel drafting efficiency with branch-wise causal conditioning. On H100 GPUs, it achieves 9.64x speedup on MATH-500 and 4.58x on open-ended conversations, outperforming existing tree-based methods on dense and MoE Qwen3 models.

Benchmarks Code generation Open source

SIG

HYP

arXiv cs.CL·Jun 18

VISUALSKILL: Multimodal Skills for Computer-Use Agents

VISUALSKILL introduces hierarchical multimodal skills for computer-use agents. Combining authored documentation with live UI exploration, the system improves Claude Opus 4.6 performance by +15.3 points on CUA-World and OSExpert-Eval (0.456 vs 0.303 baseline). Visual figures outperform text-only descriptions (+8.3 points).

Claude AI Agents MCP

SIG

HYP

arXiv cs.CL·Jun 18

LLM Parameters for Math Across Languages: Shared or Separate?

Mechanistic analysis of mathematical reasoning in multilingual LLMs. Math-associated parameters exhibit partial cross-lingual overlap, concentrated in intermediate layers. English produces the largest set of math-relevant parameters, while lower-resource languages reveal smaller parameter sets.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Montreal Forced Aligner 3.0, the reference tool since 2016 for forced speech-to-text alignment, achieves state-of-the-art performance on English, Japanese, and Korean with boundary errors <15ms. New capabilities: model adaptation, cross-language phone remapping, expanded language/dialect coverage, harmonized IPA dictionaries.

Voice Benchmarks Open source

SIG

HYP

arXiv cs.CL·Jun 18

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

MCompassRAG improves RAG systems by using topic-level metadata as a semantic compass for paragraph-level retrieval. The method enriches chunk representations with topic signals in the same embedding space and trains a lightweight retriever via LLM-teacher distillation. Across six benchmarks, it gains 8.24% in information efficiency with 5× lower latency than efficient RAG baselines.

RAG Embeddings Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Dual Dimensionality for Local and Global Attention

Researchers propose Distance-Adaptive Representation (DAR): reduce key/value dimensionality beyond a local window in decoder-only Transformers. Nearby tokens require full representations for next-token prediction, while distant tokens can use 1/4 original dimensionality without performance loss. Tested on 70M–410M models and 1B fine-tuning.

Reasoning Infrastructure Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

CDDTLDA framework for Chinese dialect discrimination with scarce annotation resources. Uses transfer learning on ASR models, data augmentation (speed, pitch, noise), and self-attention to capture shared semantic features. Outperforms state-of-the-art on two benchmark corpora.

Voice Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes

PEC-Home is a simulated home dataset for interpreting progressively elliptical commands in smart homes. Current assistants (including GPT-4o) fail to execute these abbreviated commands accurately due to accumulated shared context, even when equipped with dialogue history retrieval.

AI Agents Benchmarks RAG

SIG

HYP

arXiv cs.CL·Jun 18

TW-LegalBench: Measuring Taiwanese Legal Understanding

TW-LegalBench evaluates 13 LLMs on Taiwanese law using 16,000+ multiple-choice questions, 117 open-ended essays, and 14,000+ legal judgment prediction cases. Top models exceed lawyer qualification threshold (11%) but fall short for judges/prosecutors (1-2%). Models struggle to cite exact legal articles.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

Output Vector Editing for Memorization Mitigation in Large Language Models

Memorization suppression method in LLMs via output vector editing of MLP neurons. Tested on 4 models (360M-7B parameters), achieves 87.9% suppression on OLMo-7B with 6831 memorized sequences. Complementary approach to existing neuron ablation methods.

AI safety Alignment Papers

SIG

HYP

arXiv cs.CL·Jun 18

RedactionBench

RedactionBench is a manually annotated benchmark of 200 documents across 11 domains for evaluating PII redaction in context. Introduced with R-Score, a character-level metric, it shows 35 models (NER, SLM, frontier models) fail on contextual redactions: human consensus 89.4% for mandatory redactions, 47.7% for contextual ones.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.CL·Jun 18

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

Study on evaluating AI-generated radiology reports. Researchers show existing LLMs over-penalize harmless rephrasings while detecting clinical errors. They train lightweight metrics on Qwen3-8B and MedGemma-4B outperforming 32B medical models, with dataset and metric release planned.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.LG·Jun 18

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

PROPEL is a framework training task generators via RL to create optimally difficult problems for agent learning. A lightweight probe predicts solver pass rate without repeated rollouts, reducing evaluation to a single forward pass. On code and SWE tasks, learnable-frontier generation increases from 10.1% to 20% (Qwen2.5-3B) and 9.8% to 19.6% (Qwen3.5-27B).

Reinforcement learning AI Agents Code generation

SIG

HYP

arXiv cs.CL·Jun 18

Approximate Structured Diffusion for Sequence Labelling

New approach combining diffusion and CRF for sequence labelling in NLP. Method conditions a CRF on the full label sequence (noisy), bypassing span limitations of standard CRFs. Results: 16.5% error reduction on POS-tagging.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

Enhanced Graph Neural Networks using K-Hop Gaussian Diffusion

New K-Hop Gaussian (KHG) diffusion method to enhance GNNs. KHG preprocesses graph data with multi-hop diffusion weighted by Gaussian, balancing local and global propagation. Outperforms standard message-passing, PPR, and Heat Kernel on benchmarks, especially on noisy graphs.

Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE is a stochastic prompt optimization framework using multi-agent guided exploration. Compares three strategies: error-informed random search, genetic algorithm, and SAGE with diagnostic code execution. Deployed on mental-health chatbot: 8 cycles of noisy A/B tests compound into statistically robust next-day retention gain.

Prompt engineering AI Agents Multi-agent

SIG

HYP

arXiv cs.LG·Jun 18

Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing

Gaussian Mixture Attention (GMA) replaces standard attention with probabilistic routing through K learned Gaussian mixture components. Queries and keys map to responsibility vectors in a shared latent space. GMA avoids explicit N×N matrix materialization, reducing memory complexity to O(NK) instead of O(N²). Competitive on long-context classification, but behind SDPA and Mamba on WikiText-103.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 18

Artemis: Anatomy-Resolved inTervention for Eliminating Multimodal NeuroImage confounderS

Artemis is a causal framework for graph neural networks addressing demographic confounders (age, sex) in multimodal brain imaging (fMRI + DTI). The method applies causal interventions at each brain region independently to learn invariant representations. Tested on ADNI, OASIS, and HCP benchmarks, it improves disease diagnosis and classification tasks.

Papers Reasoning Alignment

SIG

HYP

arXiv cs.LG·Jun 18

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

Structural pruning framework for Mixture-of-Experts models operating at channel level rather than expert level. Attribution-based method reformulates pruning as channel-score coverage maximization. Experiments on DeepSeek and Qwen models achieve 50% structured pruning with 4-bit quantization, 5.27× memory reduction on Qwen3-30B-A3B.

DeepSeek Qwen Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

Fisher Width: A Geometric Measure of Complexity on Statistical Manifolds

New geometric complexity measure called Fisher width, a Fisher-geometric analogue of Gaussian width on statistical manifolds. Replaces Euclidean geometry with Fisher information metric to capture local statistical curvature. Develops foundational theory with generalization bounds and computable estimators, validated on MNIST.

Papers Benchmarks Evals

SIG

HYP

arXiv cs.LG·Jun 18

A Survey on Data-Driven Models for Soil Moisture Regression and Classification

Survey of AI-based models for soil moisture estimation and classification. Five categories compared: statistical time-series, geostatistical methods, classical ML, deep learning, and Bayesian approaches. Data-driven methods provide flexible alternatives to computationally expensive physics-based models.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 18

Why SWAVE May Not Be All You Need:A Concept-Evolution Retrospective on Complex-Valued Recurrent Language Models

SWave is a complex-valued recurrent language model (169M parameters) trained on FineWeb-Edu. The paper documents its evolution across three phases, identifying structural failures (cos-domination collapse) and validating critical components (ComplexNorm, Wave Propagation Scan). Final PPL: 22.0 at step 89,861.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

New LLM inference scheduler replacing explicit length prediction with lightweight statistical signals and dynamic priority boosting. Reduces P99 TTLT by 35-50% vs SRPT with perfect length knowledge, and TTFT by 34-47% across production and open-source traces.

Benchmarks Infrastructure Reasoning

SIG

HYP

arXiv cs.LG·Jun 18

Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

RL framework inspired by neuroscience that disentangles dynamics-specific and reward-specific features using locally linear embeddings (LLE) and adaptively fuses representations via attention mechanism. Improves learning efficiency on benchmark tasks compared to conventional RL approaches.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

Quantum Annealing Enhanced Reinforcement Learning for Accurate Remaining Useful Lifetime Prediction

QAQL framework couples quantum annealing with Q-learning for remaining useful life (RUL) prediction in predictive maintenance. Each Q-value update encoded as QUBO solved on D-Wave Advantage system. Validated on NASA C-MAPSS and fleet maintenance datasets: statistically significant improvements over classical and quantum baselines.

Reinforcement learning Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 18

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

PSyGenTAB is a privacy-preserving framework for synthetic clinical tabular data generation formulated as constrained optimization solved via Augmented Lagrangian Method. It embeds configurable privacy constraints into training to preserve inter-feature clinical relationships and minority-class patterns while maintaining data utility for medical AI applications.

Benchmarks

SIG

HYP

arXiv cs.AI·Jun 18

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb is a benchmark of 372,648+ instances for evaluating defeasible abduction reasoning in language models. Best frontier models reach 65% under standard conditions but drop to 23.5% under rendering-robust evaluation, versus 100% for symbolic logic solvers. The benchmark includes three difficulty levels with polynomial-time verifiable gold standards.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 18

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

WorldLines is a long-horizon embodied agent benchmark testing memory in dynamic household environments. The dataset includes temporally extended traces with dialogues, actions, and object/device state changes. ObsMem, an observer-grounded memory framework, maintains visibility-aware memories and action-native state trails for state-informed decisions.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·Jun 18

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench is a safety evaluation benchmark for LLMs in AI4Science workflows. It covers 7 disciplines, 31 sub-disciplines, and 10 risk dimensions. The authors evaluate mainstream and science-oriented LLMs to diagnose safety gaps across risk categories.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.AI·Jun 18

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench is a benchmark to evaluate strategic reasoning in Vision-Language Models (VLMs) using real-time strategy games. Built on Beyond All Reason, it offers multi-scenario evaluations, diagnostic mini-games targeting specific competencies, and a self-evolving generation framework. Current state-of-the-art VLMs fail at multi-agent coordination and complex task scaling.

Vision Reasoning Multi-agent

SIG

HYP

arXiv cs.AI·Jun 18

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

TxBench-PP is a verified benchmark evaluating AI agents on small-molecule preclinical pharmacology. 100 evaluations span mechanism-of-action, pharmacodynamics, compound-target engagement, and safety. Across 16 configurations (11 models, 4,800 trajectories), Claude Opus 4.8 achieves 59.3% success rate, GPT-5.5 55.3%. No system reliably masters these decisions.

AI Agents Benchmarks Claude

SIG

HYP

arXiv cs.AI·Jun 18

NeSyCat Torch: A Differentiable Tensor Implementation of Categorical Semantics for Neurosymbolic Learning

NeSyCat Torch unifies neurosymbolic semantics (classical, fuzzy, probabilistic, neural) under a single truth definition parametrized by monads. Implemented in PyTorch, JAX, and HaskTorch, the framework interprets computational symbols via neural networks. On MNIST addition, outperforms LTN and DeepProbLog in speed and accuracy.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.CL·Jun 18

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

Activation steering improves synthetic data generation for low-resource languages. Two strategies tested: Language Steering (linguistic identity) and Quality Steering (well-formedness). Evaluation across 4 open-source LLMs, 11 languages, classification tasks. Early-layer steering increases diversity and downstream performance.

Prompt engineering Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

CoreMem introduces a memory architecture for personalized dialogue agents on edge devices (8 GB VRAM). Replaces cosine similarity with Fisher-Rao metric for retrieval and uses Fisher-guided token distillation for compression. Achieves +4.51 pp gains in open-domain reasoning and +4.17 pp in temporal reasoning on LOCOMO and LongMemEval-S benchmarks.

AI Agents RAG Embeddings

SIG

HYP

arXiv cs.CL·Jun 18

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

Paper presents speech-driven approach for Chinese dialect discrimination. Combines MFCC features, HMM-DNN speech recognition model, attention mechanism and CNN. Evaluation on two benchmark Chinese dialect corpora shows improvement over state-of-the-art methods.

Voice Benchmarks Papers

SIG

HYP

Benchmarks — AI news · Signal IA