Topic

#Evals

Evals (evaluations) are standardized tests that measure an AI model's capabilities and limitations across specific tasks. Eleuther AI's lm-evaluation-harness, for instance, is a widely used open-source framework to benchmark language models consistently.

40Articles

5Sources

73Avg. signal

arXiv cs.CL·Jun 18

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

Benchmark of 1,200 clinical documents with 9,184 uncertainty annotations across five levels. LLMs poorly preserve uncertainty expressions (less than 50% of cases) and struggle with nuanced distinctions between adjacent levels. Reveals a failure mode missed by standard metrics.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.CL·Jun 18

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

arXiv study assessing LLM ability to interpret negation in figurative language. Researchers annotate an existing dataset and evaluate multiple models. Finding: negation combined with figurativeness presents particular challenge, with performance heavily dependent on prompt style.

Evals Prompt engineering Reasoning

SIG

HYP

arXiv cs.AI·Jun 18

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

ForecastBench-Sim is a forecasting benchmark built on Freeciv game simulations. Models receive a structured game state and predict hidden future states; the benchmark continues the simulation to score forecasts. Enables questions at arbitrary time horizons, counterfactual worlds, and rare events.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 18

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

Study of collateral damage in LLM machine unlearning. Authors show damage propagates beyond the forget set following semantic distance gradients, and propose PreUnlearn, a pre-unlearning prediction method to audit risks before execution.

AI safety Alignment Papers

SIG

HYP

arXiv cs.CL·Jun 18

Steerable Cultural Preference Optimization of Reward Models

Novel SCPO algorithm for training reward models that balance diverse cultural preferences across subcommunities. Achieves 7-point improvements for minority reward models on PRISM and GlobalOpinionQA (7 countries), with 280% better training data efficiency than full-finetuning.

Alignment Reinforcement learning Evals

SIG

HYP

arXiv cs.CL·Jun 18

BCL: Bayesian In-Context Learning Framework for Information Extraction

BCL is an optimization framework for information extraction using particle filtering and Bayesian updates to systematically refine label representations. It generalizes across sequence labeling and relation classification tasks, demonstrating consistent improvements over existing approaches across model scales.

Prompt engineering Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 18

TW-LegalBench: Measuring Taiwanese Legal Understanding

TW-LegalBench evaluates 13 LLMs on Taiwanese law using 16,000+ multiple-choice questions, 117 open-ended essays, and 14,000+ legal judgment prediction cases. Top models exceed lawyer qualification threshold (11%) but fall short for judges/prosecutors (1-2%). Models struggle to cite exact legal articles.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

RedactionBench

RedactionBench is a manually annotated benchmark of 200 documents across 11 domains for evaluating PII redaction in context. Introduced with R-Score, a character-level metric, it shows 35 models (NER, SLM, frontier models) fail on contextual redactions: human consensus 89.4% for mandatory redactions, 47.7% for contextual ones.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.CL·Jun 18

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

Study on evaluating AI-generated radiology reports. Researchers show existing LLMs over-penalize harmless rephrasings while detecting clinical errors. They train lightweight metrics on Qwen3-8B and MedGemma-4B outperforming 32B medical models, with dataset and metric release planned.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.CL·Jun 18

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

ScholarSum introduces a hierarchical knowledge graph framework for abstractive scientific summarization. The system organizes documents into semantically coherent units, generates an initial draft, then refines it through iterative verification and rewriting to ensure logical coherence and factual faithfulness.

Papers RAG Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

Improving Medical Communication using Rubric-Guided Counterfactual Recommendations

LM-guided counterfactual recommendation pipeline to improve medical communication in text-based telemedicine. System identifies interpretable features (tone, personalization, clarity, completeness) and recommends minimal communication changes predicted to increase positive feedback (+6.41% mean gain). Modifications preserve medical content and physician control.

Reasoning Evals RAG

SIG

HYP

arXiv cs.CL·Jun 18

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE is a stochastic prompt optimization framework using multi-agent guided exploration. Compares three strategies: error-informed random search, genetic algorithm, and SAGE with diagnostic code execution. Deployed on mental-health chatbot: 8 cycles of noisy A/B tests compound into statistically robust next-day retention gain.

Prompt engineering AI Agents Multi-agent

SIG

HYP

arXiv cs.LG·Jun 18

ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots

ASTRA is an air traffic control training simulator automating pilot roles through speech recognition, instruction interpretation, and response generation. The system reduces Word Error Rate from 107.80% to 23.45% on Singaporean-accented aviation speech, and evaluates trainee radiotelephony communications achieving 91.7% accuracy, 88.2% brevity, and 86.9% completeness scores.

Voice Fine-tuning Evals

SIG

HYP

arXiv cs.LG·Jun 18

Fisher Width: A Geometric Measure of Complexity on Statistical Manifolds

New geometric complexity measure called Fisher width, a Fisher-geometric analogue of Gaussian width on statistical manifolds. Replaces Euclidean geometry with Fisher information metric to capture local statistical curvature. Develops foundational theory with generalization bounds and computable estimators, validated on MNIST.

Papers Benchmarks Evals

SIG

HYP

arXiv cs.LG·Jun 18

Self-CTRL: Self-Consistency Training with Reinforcement Learning

Self-CTRL optimizes consistency between language models' self-explanations and behavior via reinforcement learning. On probabilistic reasoning tasks, the method improves R² correlation from 0.24 to 0.64. In constitutional AI, it increases refusal prediction from 36% to 92% and reduces HarmBench failure rate from 15.0% to 0.5%.

Reinforcement learning Alignment AI safety

SIG

HYP

arXiv cs.LG·Jun 18

P$^2$CE: Model-Agnostic Plausible Pareto-Optimal Counterfactual Explanations

P²CE generates plausible Pareto-optimal counterfactual explanations for ML models. The algorithm uses isolation forests and SHAP values to balance feasibility, plausibility, and computational efficiency. Evaluated on 3 datasets, it outperforms existing methods in solution quality and speed.

Evals

SIG

HYP

arXiv cs.LG·Jun 18

What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

Study on grokking (delayed transition from memorization to generalization). Authors show weight norm doesn't directly control grokking delay but acts through logit scale. Fixing norm and varying output temperature, they recover 85% of delay by matching logit scale. Effect is loss-dependent (cross-entropy vs MSE). Logit scale and softmax saturation are the proximal variables.

Papers Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 18

Searching for Synergy in Shared Workspace Human-AI Collaboration

Study of human-AI team collaboration in shared workspace using Collaborative Gym and DiscoveryBench. Adding collaborators improves performance only with coordination structure. Scaffolding combining shared memory and human-in-the-loop gates increases performance, especially in three-person teams, by clarifying responsibilities and routing expertise.

AI Agents Multi-agent Evals

SIG

HYP

arXiv cs.AI·Jun 18

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb is a benchmark of 372,648+ instances for evaluating defeasible abduction reasoning in language models. Best frontier models reach 65% under standard conditions but drop to 23.5% under rendering-robust evaluation, versus 100% for symbolic logic solvers. The benchmark includes three difficulty levels with polynomial-time verifiable gold standards.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 18

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench is a safety evaluation benchmark for LLMs in AI4Science workflows. It covers 7 disciplines, 31 sub-disciplines, and 10 risk dimensions. The authors evaluate mainstream and science-oriented LLMs to diagnose safety gaps across risk categories.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.AI·Jun 18

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

ThinkDeception introduces a progressive reinforcement learning framework for interpretable multimodal deception detection. Using MLLMs, it converts binary classification into explicit reasoning via Chain of Thought. VAC-GRPO with curriculum learning stratified into 4 difficulty tiers achieves SOTA on mainstream benchmarks.

Reasoning Reinforcement learning Vision

SIG

HYP

arXiv cs.AI·Jun 18

Analysing drivers and interdependencies in European electricity markets using XAI

Study combining deep neural networks with XAI (SHAP, SSHAP) to analyse 39 European electricity bidding zones. Identifies solar energy as disproportionate price driver, gas prices as dominant factor, and interconnections revealing interdependence of electricity markets.

Evals Papers

SIG

HYP

arXiv cs.AI·Jun 18

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

TxBench-PP is a verified benchmark evaluating AI agents on small-molecule preclinical pharmacology. 100 evaluations span mechanism-of-action, pharmacodynamics, compound-target engagement, and safety. Across 16 configurations (11 models, 4,800 trajectories), Claude Opus 4.8 achieves 59.3% success rate, GPT-5.5 55.3%. No system reliably masters these decisions.

AI Agents Benchmarks Claude

SIG

HYP

arXiv cs.CL·Jun 18

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

PhysAssistBench is an interactive medical assistance benchmark with 1,296 physician-validated turns built from real MIMIC-IV cases. It evaluates LLMs' ability to coordinate clinical knowledge, patient communication, and EHR system interaction within single dialogues. Experiments show current models remain unreliable in this setting.

Benchmarks AI Agents Multi-agent

SIG

HYP

arXiv cs.CL·Jun 18

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

Study evaluating 42 LLMs (proprietary and open-source) on their ability to measure item discrimination in reading comprehension. Models fail: Spearman correlation of 0.152 in direct prediction, 0.241 in CTT calibration. LLMs do not reliably capture how assessment items distinguish students of different proficiency levels.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.LG·Jun 18

DRIFT: Refining Instruction Data via On-Policy Data Attribution

DRIFT refines SFT training data distribution using on-policy Influence Functions. The method uses model rollouts as validation targets to minimize proximity gap and debias gradient norm bias. Experiments on 7B instruction and reasoning models show consistent performance ceiling improvements over existing curation baselines.

Fine-tuning Reinforcement learning Evals

SIG

HYP

arXiv cs.LG·Jun 18

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

Sparse Autoencoders (SAEs) decompose activations into interpretable features, but this study shows that clamping a 'harmful' feature does not eliminate the behavior—it can recover via other residual pathways. Even with active intervention, 95.8% behavior recovery is achievable in refusal-steering, exposing a gap between feature-level control and behavioral completeness.

AI safety Alignment Evals

SIG

HYP

arXiv cs.LG·Jun 18

Neural Network Implementation of the Renormalization Group for Fault Diagnosis with Class Imbalance

RGNet, a neural network architecture based on the renormalization group, addresses class imbalance and multidimensional noise for fault diagnosis. The model hierarchically compresses feature space and captures both local details and global patterns. Tested on imbalanced AI4I dataset.

Papers Evals Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

Measurement noise limits the advantage of nonlinear models over linear models in biomedical prediction

arXiv paper demonstrates that on biomedical tabular data, measurement noise limits the advantage of nonlinear models (deep networks, gradient boosting) over linear regression. Degree-k interactions are attenuated by the k-th power of feature reliability, while linear components are attenuated only once. Analysis of 140 UK Biobank tasks confirms this noise signature.

Benchmarks Evals

SIG

HYP

arXiv cs.LG·Jun 18

A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

Evaluation protocol for single-image-to-3D mesh quality using VLM judges (vision-language models). Authors demonstrate that cheap proxies (CLIP similarity, geometry validity stats) fail to correlate with perceived quality. Their VLM-judge protocol with position-bias correction achieves Cohen's kappa = 0.66 between two independent judge families.

Vision Evals Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

The Illusion of Improvement: Reject Inference Strategies in Credit Scoring

Reject inference methods used in credit scoring to correct survival bias mask a structural failure: accuracy can improve while the ability to correctly reject defaulters collapses. Authors propose a controlled exploration strategy (approving 2-5% of rejected applicants) to diagnose this deterioration without strong statistical assumptions.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.LG·Jun 18

Beyond AHI: An Interpretable Causal-Discovery-Guided Framework for Sleep Recovery in Connected Health

Causal framework for sleep recovery scoring from multimodal polysomnography. Uses DAG learning on two cohorts (MESA n=1540, MrOS n=825) to identify five physiological domains (respiratory burden, hypoxia, fragmentation, architecture, autonomic regulation). Sleep Recovery Score (SRS) achieves 2.5× stronger alignment with perceived recovery than standard AHI.

Papers Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 18

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

CaVe-VLM-CoT is a modular agentic-RAG framework reducing VLM hallucinations through a five-stage closed-loop pipeline (Extractor, Retriever, Solver, Citation Injector, Verifier). Ungrounded claims trigger targeted re-retrieval. 23 component-wise metrics and CaVeScore measure citation faithfulness and cross-modal grounding. Results: 87.1% accuracy on ScienceQA, 55.2% on MMMU.

Vision RAG AI Agents

SIG

HYP

arXiv cs.AI·Jun 18

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

Xcientist is a research harness that externalizes research synthesis and experimental validation for AI scientists into inspectable, contract-governed processes. It organizes literature evidence, idea states, implementation plans, and repair traces as persistent research artifacts, eliminating claim drift where runnable artifacts no longer support the originally claimed mechanism.

AI Agents Reasoning Evals

SIG

HYP

The Decoder·Jun 17

Microsoft researcher builds a working neural network out of goats in Age of Empires II to critique AI science

A Microsoft researcher built a working neural network using goats in Age of Empires II's map editor to critique AI research methods. His analysis of 315 papers found over 50% presuppose language models have human-like traits before the experiment begins.

Papers Alignment Evals

SIG

HYP

Reddit r/MachineLearning·Jun 17

Contrastive targeted SFT as a mechinterp method - has anyone mapped causal dependency interactions this way? [D]

Researcher experiments with iterative targeted SFT combined with mechanistic interpretability on a 31B model. Strategy: contrastive training on specific capability dimensions, then circuit ablation to map causal dependencies between dimensions and optimize future training order.

Fine-tuning Reasoning Evals

SIG

HYP

The Decoder·Jun 17

OpenAI researchers want to predict how often AI models will fail before launch

OpenAI researchers propose a method to predict how often a new AI model will make mistakes after release. This approach could fill gaps left by standard safety testing.

OpenAI Evals AI safety

SIG

HYP

arXiv cs.CL·Jun 17

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

MultiClin, a clinical ASR benchmark, evaluates speech recognition model robustness to multiscript variability (multiple valid orthographic forms of the same term). Conventional metrics underestimate performance. Script unification consistently yields best ASR performance.

Benchmarks Voice Evals

SIG

HYP

arXiv cs.CL·Jun 17

From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities

Analysis of 4,434 posts and 50,338 comments on Moltbook showing parasocial interaction cues (intimacy language, reciprocity bids, self-identification) persist in autonomous AI-agent communities. Results validated through keyword matching and LLM annotation reveal strong association between these signals and original poster re-engagement and sustained dyadic patterns.

AI Agents Multi-agent Papers

SIG

HYP

arXiv cs.LG·Jun 17

Rift: A Conflict Signature for Deception in Language Models

Researchers identify an internal signature of deception in language models: deceptive responses show 2.1-2.3x higher residual rank than naively false answers. This signature detects deception with 100% accuracy on GPT-2, Qwen2.5, and Phi-3, and transfers zero-shot across model families and languages (AUC 0.933-1.0).

AI safety Alignment Evals

SIG

HYP