Page 21 of 192

AllHigh signalRecent

7679 articles

AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue

AERIC is a lightweight safety monitor (387 parameters) detecting implicit harmful dialogue by analyzing hidden states during decoding without additional forward passes. On DiaSafety and Harmful Advice, it improves AUROC from 0.683→0.714 and 0.822→0.858. Deployment adds only 2.34% latency versus 79.40% for Qwen3Guard-Stream-4B.

AI safety Alignment Reasoning

SIG

HYP

arXiv cs.AI·May 26

When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure

arXiv paper reveals LLMs abandon correct diagnoses under escalating pressure in multi-turn clinical dialogue despite strong benchmark performance. Authors introduce Med-Stress (belief stability stress test), RBED (inference-time defense), and R-FT (resilience-oriented fine-tuning) to improve robustness across nine frontier models.

AI safety Alignment Evals

SIG

HYP

arXiv cs.LG·May 26

Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

Method to identify attention-head circuits in pretrained transformers using spectral signal (time-integrated participation ratio), task-pattern filtering, and group ablation against matched-random control. Validated across 51M to 7B parameters, two architectures, four pretraining pipelines. Finding: 2-6 head induction circuit causally necessary in all models tested (94-100% drop after ablation).

Papers Reasoning Evals

SIG

HYP

arXiv cs.CL·May 26

WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems

WhenLoss introduces a diagnostic protocol to identify bottlenecks in long-context memory systems. Expected Predictive Compression (EPC) uses an LLM to anticipate future questions and preserve minimal evidence at write time. On LongMemEval (500 questions), EPC achieves 0.49 CSM score vs 0.44 for strongest baseline, reducing write-side gap to 0.04.

RAG Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 26

Methods for Formal Verification of Agent Skills: Three Layers Toward a Mechanically Checkable Capability-Containment Proof

Formal verification paper for LLM agent skills. Presents three composable methods: sound static capability-containment analysis via abstract interpretation, refinement type system for tool-call envelopes, and SMT-bounded model checking. Open-source JavaScript implementation (enclawed framework) with 53 unit tests and end-to-end CLI demo.

AI Agents AI safety Reasoning

SIG

HYP

arXiv cs.AI·May 26

QUIVER: A Formal Framework for Quantifying Perturbation Propagation and Bifurcation in Compound AI Systems

QUIVER is a formal framework for measuring perturbation propagation in graph-structured LLM pipelines. It defines sensitivity matrices, trajectory divergence, bifurcation thresholds, and distribution faithfulness. Validated on 8,200+ instrumented traces across three distinct architectures.

AI Agents Evals Papers

SIG

HYP

arXiv cs.AI·May 26

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

LGMT is an oracle-free evaluation framework using first-order logic to test LLM reasoning reliability. By deriving metamorphic relations from formal logical equivalences, it constructs semantically invariant test cases. Experiments on 6 state-of-the-art LLMs expose hidden defects missed by traditional static benchmarks.

Reasoning Evals Benchmarks

SIG

HYP

arXiv cs.CL·May 26

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

Researchers apply Direct Preference Optimization (DPO) to improve English-Mandarin code-switching transcription in Audio LLMs. Three failure modes identified: language omission, translation-instead-of-transcription, hallucination. Training on 100K pairs (570 hours) reduces MER up to 89.6% (in-distribution) and 20.0% (out-of-distribution).

Reinforcement learning Alignment Voice

SIG

HYP

arXiv cs.LG·May 26

A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?

InteractBind, a dataset of ~100k protein-ligand pairs with benchmark, evaluates whether models localize binding sites or merely predict binding likelihood. Eight tested models show strong binary prediction but weak binding-site localization, revealing gaps in physical interpretability.

Benchmarks Papers Evals

SIG

HYP

arXiv cs.CL·May 26

Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches

Systematic review of 139 studies on information fusion for document classification. Meta-analysis shows multimodal fusion improves accuracy by +5.28 percentage points (p=0.0016) and multiview fusion by +4.67% accuracy. Critical finding: only 11.8% of multimodal and 23.3% of multiview studies use statistical validation, undermining reproducibility.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.CL·May 26

Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

Study on rationalization bias in LLM judges. Researchers test whether model explanations remain stable when non-evidential cues are perturbed (verbosity, confidence). They propose PROOF-BEFORE-PREFERENCE to improve cue invariance and reduce explanation anchoring.

Evals Reasoning Alignment

SIG

HYP

arXiv cs.CL·May 26

End-to-End Intracortical Speech Decoding from Neural Activity

Speech decoding from intracortical recordings in an ALS patient without external language model. End-to-end Conformer decoder achieves 23.80% character error rate on held-out validation data. Main errors stem from word boundary segmentation failures.

Reasoning Benchmarks AI safety

SIG

HYP

arXiv cs.CL·May 26

Side-by-side Comparison Amplifies Dialect Bias in Language Models

arXiv paper demonstrating that language models amplify dialect bias (AAVE vs Standard American English) when comparing tweet pairs side-by-side, far more than in isolated evaluation. Counterfactual fairness finetuning partially mitigates bias in isolation but fails in contrastive settings, exposing a critical gap in current evaluation frameworks.

Benchmarks AI safety Alignment

SIG

HYP

arXiv cs.CL·May 26

Temporal Concept Drift in Legal Judgment Prediction: Neural Baselines Across Three Epochs of Ukrainian Court Decisions

Study of temporal concept drift in legal NLP on 428K Ukrainian court decisions (2008-2026). Four transformer models (XLM-RoBERTa, legal variants) show severe forward degradation (−27.2 pp macro-F1) but robust backward transfer. Chronological continual learning eliminates catastrophic forgetting.

Benchmarks Fine-tuning Papers

SIG

HYP

arXiv cs.LG·May 26

Feature Lottery? A Bifurcation Theory of Concept Emergence

Bifurcation theory to detect in real time the emergence of structured representations in neural networks. A dynamic ratio β(t)/βc(t) based on loss Hessian predicts four distinct transition regimes (SAE on Pythia, SSL CIFAR, arithmetic grokking). At 5% of training, early atom purity predicts final convergence with 12x baseline improvement.

Papers Reasoning Fine-tuning

SIG

HYP

arXiv cs.LG·May 26

Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning

LoRDBA replaces low-rank LoRA adapter factors with binary sign carriers and channel-wise magnitude scales, reducing adapter footprint by over 10× while matching fp16 LoRA quality. Outperforms low-bit baselines at matched model sizes with ≤8% prefill latency overhead and ~1.6× training memory overhead versus fp16 LoRA.

Fine-tuning

SIG

HYP

arXiv cs.LG·May 26

PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection

PromptAudit evaluates how prompting strategies affect LLM-based vulnerability detection. Across 5 open-weight models and 1,000 CVEs (6,074 samples), standard chain-of-thought achieves strongest performance, while few-shot provides model-dependent gains. Adaptive chain-of-thought suppresses recall; self-consistency induces excessive abstention.

Prompt engineering Evals AI safety

SIG

HYP

arXiv cs.LG·May 26

LLMs Show No Signs Of Individuated Metacognition

Analysis of 20 frontier LLMs across 6 benchmarks: stated confidence does not reflect individual model capabilities. Tetrachoric factor analysis reveals confidence matrix is approximately rank-one. Models share a common item-difficulty axis and differ mainly in decision thresholds. No evidence of significant verbalised individuated metacognition found.

Evals Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·May 26

Fourier Feature Pyramids for Physics-Informed Neural Networks

Beignet, a new neural network architecture for solving partial differential equations (PDEs), replaces random Fourier feature embeddings in PINNs with a trainable multi-resolution Fourier feature pyramid. The model efficiently computes spatial derivatives via FFT and achieves higher accuracy with fewer parameters than existing PINN methods.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·May 26

CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

CurveRL introduces a distribution-aware prompt reweighting method for Reinforcement Learning with Verified Rewards (RLVR) using quantile coordinate transforms. Weights depend on rank and density of pass rates rather than absolute values, consistently outperforming GRPO and other RLVR baselines across benchmarks.

Reasoning Reinforcement learning Papers

SIG

HYP

Reddit r/MachineLearning·May 25

DCGAN inference on a microcontroller: 12.6M parameters, 512KB SRAM, 26-second generation, pure C [P]

DCGAN with 12.6M parameters runs on RISC-V CH32H417 microcontroller (512KB SRAM). Generates 64×64 cat faces in 26 seconds using pure C inference engine with int8 per-channel quantization. Weights streamed from SD card via double buffering. Z vector seeded with 200 bytes quantum random data (ANU QRNG). No existing frameworks (TFLite, CMSIS NN) — built from scratch.

Code generation Benchmarks Open source

SIG

HYP

Reddit r/LocalLLaMA·May 25

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

RTPurbo transforms full-attention LLMs into sparse models in hundreds of training steps. The method exploits three observations: only certain heads require full attention, long-range retrieval uses a 16D subspace, and token selection is query-dependent. Results: 9.36x prefill speedup at 1M context, 2.01x decode speedup, accuracy preserved.

Reasoning Benchmarks Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·May 25

NuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable)

Numind releases NuExtract3, a 4B open-weight VLM based on Qwen3.5-4B (Apache-2.0 license). The model extracts structured data and converts documents/images to Markdown. Trained for 3 days on 8xH100, it handles PDFs, forms, tables with multiple quantizations (GPTQ, W8A8, FP8, Q4, Q6) for self-hosting from 4GB VRAM.

Qwen Vision Open source

SIG

HYP

Reddit r/LocalLLaMA·May 25

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

OSCAR RotationZoo provides precomputed rotation matrices for INT2 KV-cache quantization. The method achieves ~7× KV-cache memory compression with single-digit accuracy drop on GPQA for dense reasoning models (Qwen3-4B, Qwen3-8B, GLM-4.7). Code and rotations available on HuggingFace.

Benchmarks Open source Qwen

SIG

HYP

The Decoder·May 25

Google Deepmind's AlphaProof Nexus solves decades-old math problems for a few hundred dollars

Google DeepMind's AlphaProof Nexus autonomously solved nine open Erdős problems, including two unsolved for 56 years, for a few hundred dollars per problem. The system uses the Lean compiler to automatically verify each proof step, with a 2.5% success rate.

DeepMind Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 25

Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems

A-LEMS proposes Energy per Successful Goal (EpG) metric for agentic systems instead of per-inference energy. Across 8 task families, agentic workflows consume 4.33x more energy than linear execution (888.1 J vs 205.3 J). Overhead driven by orchestration, not compute.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.LG·May 25

PACE: Two-Timescale Self-Evolution for Small Language Model Agents

PACE is a self-evolution framework for small language model agents (4B–14B parameters). It coordinates prompt refinement with control-logic updates via held-out validation, without requiring frontier models. Across 12 backbone–benchmark combinations, PACE improves vanilla SLM agents by +9.2% and single-mode evolution baselines by +5.4%.

AI Agents Prompt engineering Reasoning

SIG

HYP

arXiv cs.AI·May 25

SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research

SciAtlas is a large-scale academic knowledge graph integrating 43M papers across 26 disciplines, 157M entities, and 3B triplets. It features a neuro-symbolic retrieval algorithm with tri-path collaborative recall and graph reranking to enhance semantic search and reduce inference costs for AI agents in automated scientific research.

AI Agents RAG Benchmarks

SIG

HYP

arXiv cs.CL·May 25

When AI Takes Sides on Questions of Faith: Persistent Asymmetries in AI-Mediated Faith Guidance

Study of 20 commercial and open-source LLMs across 182 religious pairings. Models exhibit persistent asymmetries: they favor conversions to Catholicism, Bahá'í, Sikhism and discourage conversions to Atheism, Agnosticism, Jehovah's Witnesses. Grok 4.20 shows strongest asymmetries. Patterns reproducible across question phrasings.

Llama GPT Alignment

SIG

HYP

arXiv cs.AI·May 25

EVE-Agent: Evidence-Verifiable Self-Evolving Agents

EVE-Agent is a self-evolving agent that generates its own questions, answers, and verifiable evidence spans without human annotations. An evidence verifier rewards text spans based on their marginal contribution to correct answers. The training curriculum becomes auditable and trustworthy without external oracles.

AI Agents Reasoning RAG

SIG

HYP

arXiv cs.AI·May 25

GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

GENSTRAT introduces a benchmark for evaluating strategic reasoning in LLMs using procedurally generated card games. Evaluation of 9 models (GPT-5, Claude, Gemini-3.1-Pro) across 36,000+ matches. Methodology decomposes competence across 6 axes and measures local volatility (jaggedness) to diagnose real-world deployments.

Benchmarks Reasoning GPT

SIG

HYP

arXiv cs.AI·May 25

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

Co-ReAct integrates step-level rubrics to guide ReAct agents in multi-step search-intensive reasoning tasks. A rubric generator trained with GRPO optimizes list-wise Spearman rank-correlation against multi-judge expert consensus. Measured improvements on DeepResearchBench and SQA-CS-V2 across 8B/14B and frontier models.

AI Agents Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·May 25

RMA: an Agentic System for Research-Level Mathematical Problems

RMA is a multi-agent system for solving research-level mathematical problems. It decomposes proof solving into specialized modules (problem analysis, literature search, verification) coordinated by initializer, proposer, and verifier agents. On the First Proof benchmark (10 problems), RMA solves 8/10 and outperforms GPT-5.2R and Aletheia.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.CL·May 25

Learnability-Informed Fine-Tuning of Diffusion Language Models

New LIFT method for fine-tuning diffusion language models (DLMs). Analysis shows vanilla SFT ignores token learnability based on masking. LIFT aligns learning with diffusion steps: easy tokens when input is masked, hard tokens with more context. Up to 3x gains on AIME'24/25 vs SFT baselines.

Fine-tuning Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 25

Query-Adaptive Semantic Chunking for Retrieval-Augmented Generation: A Dynamic Strategy with Contextual Window Expansion

QASC (Query-Adaptive Semantic Chunking) improves document segmentation for RAG by integrating user queries at chunking stage. Using cosine similarity scoring, contextual window expansion, and chunk-level aggregation, QASC achieves F1=0.85, a 18-27% relative improvement over fixed chunking and 8-12% over semantic/agentic methods on 100 technical documents and 200 queries.

RAG Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 25

RAS: Reflection-Augmented Scaling with In-Context Learning for Executable Cypher Query Generation

RAS (Reflection-Augmented Scaling) improves Cypher query generation by leveraging database error messages through in-context learning. Across three Neo4j datasets and five code-specialized language models, RAS reduces Query Execution Error Rate by 41–50% (n=5), outperforming independent resampling (32–38%).

Code generation Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 25

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Red-teaming study of 30+ open-source LLMs (10 families, 5 countries) measuring capacity to generate biased political content via jailbreaks. Findings: systematic asymmetries (left-leaning bias), Overton Window contraction with model size, substantial regional differences, variable jailbreak potency across model families.

AI safety Alignment Open source

SIG

HYP

arXiv cs.CL·May 25

A Proactive Multi-Agent Dialogue Framework for Assessing Social Language Disorder Traits in Autism

TPA (Think, Plan, Ask) is a multi-agent framework that guides LLMs to proactively select questioning strategies for assessing Social Language Disorder (SLD) traits in autism. Tested on 484 clinical episodes (ADOS-2), TPA achieves 82.1% SLD trait coverage vs 65.5% for clinicians, with superior diagnostic efficiency (AUCC: 0.628 vs 0.458).

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.CL·May 25

Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

Cross-lingual red-teaming study of four MLLMs (Claude Sonnet 4.5, GPT-5, Pixtral Large, Qwen Omni) showing jailbreak vulnerability varies by language. Role-play attacks less effective in Mexican Spanish, visual attacks more effective. Safety rankings do not transfer across languages.

AI safety Alignment Evals

SIG

HYP

arXiv cs.CL·May 25

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

Audit of 11 long-context reasoning benchmarks finds none jointly control task position, filler content, and context length. Evaluation of 9 LLMs using Context Rot Evaluation (CRE) reveals sharp accuracy drops when target task moves from end to middle (e.g., Mimo-v2-Flash -88pp at 64K). Newer model releases show reduced positional vulnerability.

Benchmarks Reasoning Evals

SIG

HYP