Topic

#Benchmarks

Les benchmarks en IA sont des jeux de tests standardisés qui mesurent et comparent objectivement les performances des modèles sur des tâches définies. Par exemple, MMLU évalue la capacité des modèles de langage à répondre à des questions dans plus de 50 disciplines académiques.

40Articles

3Sources

74Signal moyen

#Benchmarks

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

CEO-Bench: Can Agents Play the Long Game?

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

VISUALSKILL: Multimodal Skills for Computer-Use Agents

LLM Parameters for Math Across Languages: Shared or Separate?

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

Dual Dimensionality for Local and Global Attention

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes

TW-LegalBench: Measuring Taiwanese Legal Understanding

Output Vector Editing for Memorization Mitigation in Large Language Models

RedactionBench

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

Approximate Structured Diffusion for Sequence Labelling

Enhanced Graph Neural Networks using K-Hop Gaussian Diffusion

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing

Artemis: Anatomy-Resolved inTervention for Eliminating Multimodal NeuroImage confounderS

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

Fisher Width: A Geometric Measure of Complexity on Statistical Manifolds

A Survey on Data-Driven Models for Soil Moisture Regression and Classification

Why SWAVE May Not Be All You Need:A Concept-Evolution Retrospective on Complex-Valued Recurrent Language Models

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

Quantum Annealing Enhanced Reinforcement Learning for Accurate Remaining Useful Lifetime Prediction

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

NeSyCat Torch: A Differentiable Tensor Implementation of Categorical Semantics for Neurosymbolic Learning

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects