Topic

#Evals

Evals (evaluations) are standardized tests that measure an AI model's capabilities and limitations across specific tasks. Eleuther AI's lm-evaluation-harness, for instance, is a widely used open-source framework to benchmark language models consistently.

40Articles
6Sources
74Avg. signal
arXiv cs.CL·

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Systematic audit of FOLIO and MALLS benchmarks reveals 39% and 36% errors in FOL formalizations respectively. Authors release corrected annotations and an LLM-based framework to guide manual relabeling, achieving 90% dataset accuracy by reviewing <24% of instances versus >70% for unguided review. Testing on Gemma 31B, Qwen3-30B, and GPT-4o-mini shows +9 to +22 percentage point accuracy gains.

BenchmarksEvalsReasoning
SIG
82
HYP
00
arXiv cs.AI·

ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning

ChatHealthAI aligns structured EHR representations from a pretrained EHR foundation model with a frozen LLM's semantic space via a task-aware resampler. The multimodal framework integrates longitudinal patient representations with refined clinical event descriptions, improving interpretable clinical reasoning while maintaining competitive predictive performance on the EHRSHOT benchmark.

RAGReasoningEvals
SIG
72
HYP
00
Reddit r/MachineLearning·

LLM agents patch security bugs, pass all tests, but still leave the vulnerability open [R]

CVE-Bench evaluates 5 frontier models on 20 real-world CVEs (Pillow, GitPython, urllib3, etc.) across 300 runs. Max solve rate 50% (60% under advisory). Agents patch syntactically but leave vulnerabilities open. Significant cross-family gaps (OpenAI vs Laguna, p<0.05), within-family noise. Failure modes: wrong-search drift, hallucinations, context loss.

AI AgentsBenchmarksAI safety
SIG
78
HYP
00
arXiv cs.AI·

Product-Aware Deep Autoencoders for Robust Process Monitoring in Multi-Product Cyber-Physical Systems

Academic paper proposing product-aware autoencoders for anomaly detection in multi-product cyber-physical systems. Traditional global models create blind spots where attacks can evade detection. Tests on Tennessee Eastman Process benchmark: product-aware model achieves 100% detection accuracy versus 22.2% for global baseline in attack scenarios.

BenchmarksAI safetyEvals
SIG
72
HYP
00
arXiv cs.CL·

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

arXiv study on LLM adaptation limits for annotation tasks. Toxicity detection experiments across diverse datasets show 66% of zero-shot errors resist correction via prompting (rescue rate 34.8%). Models follow misaligned definitions while maintaining confidence. Definition-Specific Familiarity (DSF) metric correlates with performance (r=+0.41), outperforming memorization metrics.

Prompt engineeringEvalsBenchmarks
SIG
78
HYP
00
arXiv cs.LG·

Beyond Augmentation: Score-Guided Pathological Prior for EEG-based Depression Detection

Novel approach for Major Depressive Disorder detection from EEG without data augmentation. SGC (Score-Guided Classification) uses an unsupervised generative network to model pathological anomalies as prior, fused with deep feature representations. Cross-Channel Spatial Adaptation module handles multi-center channel heterogeneity. Validated on Mumtaz2016 and MODMA datasets.

PapersEvalsVision
SIG
72
HYP
00
arXiv cs.LG·

Adversarially Robust Control of Conditional Value-at-Risk via Rockafellar-Uryasev Conformal Inference

Online, distribution-free framework for controlling Conditional Value-at-Risk (CVaR) in non-stationary and adversarial environments. Combines conformal tail risk control, online learning, and Rockafellar-Uryasev variational representation. Provable safety guarantees for nonlinear tail risk under arbitrary data-generating processes. Applications: portfolio risk management and LLM toxicity mitigation.

PapersAI safetyReasoning
SIG
72
HYP
00
arXiv cs.CL·

AEyeDE: An Attention-Based Attribution Framework for AI-Generated Text Detection

AEyeDE introduces an attention-based attribution framework for detecting AI-generated text using attention matrices from a proxy Transformer model. A lightweight CNN learns discriminative representations from these attribution maps. The method outperforms text-only baselines, shows strong generator-specific detection, and demonstrates robustness under cross-dataset transfer and spelling perturbations.

PapersAI safetyEvals
SIG
72
HYP
00
arXiv cs.AI·

Capability Self-Assessment: Teaching LLMs to Know Their Limits

Modern LLMs systematically overestimate their competence and attempt unsolvable queries. Researchers propose Capability Self-Assessment (CSA), formulated as a policy-learning problem using reinforcement learning, to teach models to recognize their limits. RL significantly outperforms supervised fine-tuning, preserves original capabilities, and generalizes out-of-distribution.

Reinforcement learningAlignmentEvals
SIG
78
HYP
00