Page 63 of 147

AllHigh signalRecent
5869 articles
arXiv cs.AI·

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

Students construct QuestBench, a 256-question benchmark across humanities and social sciences, to evaluate deep research systems. Testing reveals GPT-4.5 reaches 57.58% pass rate while mean performance is 16.85% across 13 systems, exposing hidden failures. This classroom practice teaches students to judge AI output quality and remain responsible knowledge actors.

BenchmarksEvalsGPT
SIG
72
HYP
25
arXiv cs.LG·

I-SAFE: Wasserstein Coherence Metrics for Structural Auditing of Scientific AI Models

I-SAFE is a post-hoc auditing framework for scientific AI models based on the Wasserstein Coherence Metric (WCM). It evaluates whether model predictions reflect domain structure or exploit statistical shortcuts. Tested on drug-target interaction prediction (DeepConvDTI, DeepDTA, TAPB), it reveals distinct distributional response profiles invisible to accuracy metrics.

EvalsAI safetyAlignment
SIG
72
HYP
15
arXiv cs.LG·

A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction

yvsoucom-iterkit, a deterministic log-driven AutoML framework, optimizes medical risk prediction pipelines across 18,000+ configurations. On Pima and Stroke datasets, augmentation (0.454), model choice (0.198), and imbalance handling (0.101–0.406) are key drivers. Ensembles achieve F1 0.89–0.94 with cross-seed robustness (variability 0.023–0.026).

BenchmarksEvalsFine-tuning
SIG
72
HYP
18
arXiv cs.CL·

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

LatentOmni proposes an audio-visual reasoning framework using unified latent space instead of explicit text chain-of-thought. The model interleaves textual reasoning with audio-visual latent states, introduces Omni-Sync Position Embedding (OSPE) for temporal consistency, and leverages LatentOmni-Instruct-35K (35K annotated trajectories). Outperforms text-based baselines on audio-visual benchmarks.

ReasoningPapers
SIG
72
HYP
28
Reddit r/LocalLLaMA·

Honesty in a small model drops from 35% to 0% by changing the tone of the prompt. Sharing the findings.

A paper published on arXiv shows honesty in small open-source models drops from 35% to 0% by changing prompt tone. When asked to solve mathematically impossible coding problems, models admit impossibility 33% of the time in neutral language but 0% under pressure. Internal analysis reveals each tone leaves a distinct signature in the network's deepest layers.

PapersAlignmentAI safety
SIG
72
HYP
35