Archives

May 2026

3148 articles

arXiv cs.LG·

Flow-Direct: Feedback-Efficient and Reusable Guidance for Flow Models via Non-Parametric Guidance Field

Flow-Direct introduces a training-free guidance framework for flow models using a persistent non-parametric guidance field. Analytically derived from the log-density ratio between base and reward-weighted target distributions, this field accumulates all evaluated samples to improve feedback efficiency and enable reusability without additional reward evaluations.

PapersReasoningReinforcement learning
SIG
72
HYP
18
arXiv cs.CL·

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

BacktestBench is the first large-scale benchmark for automated quantitative backtesting, containing 18,246 annotated QA pairs across 6 million real market records. AutoBacktest, a multi-agent system, translates natural language strategies into reproducible backtests via a Summarizer, SQL Retriever, and Python Coder. Evaluation on 23 mainstream LLMs.

BenchmarksMulti-agentCode generation
SIG
78
HYP
25
arXiv cs.CL·

Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA

Study of prompt compression on LLaDA, an 8B-parameter DLLM, using LLMLingua-2. Evaluation on GSM8K, DUC2004, ShareGPT at 2× compression ratio shows semantic preservation does not ensure stability in diffusion models: mathematical reasoning degrades substantially while summarization remains robust. Autoregressive compression methods do not transfer uniformly to DLLMs.

Prompt engineeringBenchmarksPapers
SIG
72
HYP
15
arXiv cs.CL·

Temporal Decay of Co-Citation Predictability: A 20-Year Statute Retrieval Benchmark from 396M Ukrainian Court Citations

UA-StatuteRetrieval: 20-year benchmark on 396M Ukrainian court citations. Co-citation predictability declines 33-47% (Adamic-Adar MRR 0.43→0.29). Non-uniform decay: criminal law stable (~0.40), civil law collapses (0.35→0.15) post-2017 reform. Mid-frequency articles (1K-10K citations) lose 50% predictability. E5-large detects 4.3% semantic drift.

BenchmarksEmbeddingsRAG
SIG
78
HYP
15
arXiv cs.LG·

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

Researchers identify preference instability in reward models under subtle input variations (paraphrasing, pattern injection, backdoors). They isolate unstable features using Sparse Autoencoders (SAEs) and propose two mitigation strategies: SAE Feature Steering and SAE Residual Correction, reducing incorrect preference assignments without retraining the model.

AlignmentAI safetyEvals
SIG
72
HYP
18
arXiv cs.CL·

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

Study reveals visibility asymmetry in multilingual datasets: 118 languages (59% of 200 most-spoken) have zero catalogued datasets per LRE Map and LDC. Using LLM-assisted citation-mining on Semantic Scholar, authors identify 609 unique datasets across 53 low-visibility languages, 356 openly accessible. Data scarcity is a documentation and discoverability issue, not just production.

BenchmarksOpen source
SIG
78
HYP
15
arXiv cs.LG·

Phase Transitions in Driven Informational Systems: A Two-Field Perspective on Learning Theory and Non-Equilibrium Chemistry

Theoretical paper proposing a unified framework for phase transitions in deep learning (grokking, emergent capabilities) and non-equilibrium chemistry. Introduces two gradient fields (entropy production rate and information quasi-potential) and two order parameters (adversarial breakdown threshold α†, self-referential coupling threshold κc) defining a candidate universality class.

ReasoningAlignmentPapers
SIG
45
HYP
25
arXiv cs.CL·

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

New arXiv paper proposing HRC (Hybrid Reward-Cyclic), a reward model decomposing human preferences into transitive (scalar) and cyclic (vector) components via game theory. Introduces DSPPO (Dynamic Self-Play Preference Optimization) for dynamic alignment. Improves RewardBench 2 (+1.23% on Gemma-2B-it) and achieves 44.75% on AlpacaEval 2.0.

Reinforcement learningAlignmentPapers
SIG
72
HYP
25
arXiv cs.LG·

Forecasting Medium-Horizon Alzheimer's Disease Progression: Residual Gap-Aware Transformers for 24-Month CDR-SB Change from ADNI Clinical and Biomarker Histories

Residual gap-aware transformer model for predicting Alzheimer's disease progression over 24 months. Trained on 2,600 ADNI samples, it reduces MSE by 13.1% and increases prediction-observation correlation by 26.4% versus linear mixed-effects baseline, combining statistical reference with residual learning from irregular clinical and biomarker histories.

BenchmarksPapersReasoning
SIG
72
HYP
15
arXiv cs.CL·

Agentic AI Translate: An Agentic Translator Prototype for Translation as Communication Design

Agentic AI Translate is an agentic translator prototype replacing the text-in/text-out paradigm with a four-stage cycle (Identify → Prompt → Generate → Verify). Users compose a structured translation brief through model-assisted dialogue grounded in skopos theory, register, and genre conventions. Verification uses the GEMBA-MQM error-span protocol for evidence-grounded scoring.

AI AgentsPrompt engineeringPapers
SIG
45
HYP
35
arXiv cs.CL·

PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks

PaliBench is a benchmark for Pali-to-English translation containing 1,700 passages (345,000 tokens) aligned with three independent reference translations. The method combines LLM-assisted alignment, automated verification, and multi-metric evaluation. Evaluation of ten contemporary LLMs shows strong cross-metric concordance but substantial variation in reliability.

BenchmarksEvalsPapers
SIG
72
HYP
15
arXiv cs.AI·

Bridging Silicon and the Hippocampus: Algebro-Deterministic Memory "VaCoAl" as a Substrate for Vector-HaSH and TEM

VaCoAl, a hyperdimensional memory built from Galois-field linear-feedback shift registers, provides a unified algebraic substrate for Vector-HaSH and the Tolman-Eichenbaum Machine. The approach maps two VaCoAl regimes to hippocampal EC-CA3 and EC-DG-CA3 pathways, connects deterministic GF(2) binding to Pearl's causal algebra, and derives testable iEEG predictions.

ReasoningPapersAlignment
SIG
72
HYP
18
arXiv cs.AI·

SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition

SurgicalMamba, a Mamba2-based model, performs online surgical phase recognition with O(d) per-frame cost. Three components address domain-specific challenges: dual-path SSD separating long/short-term regimes, intensity-modulated stepping adapting effective rate, and state regramming enabling cross-channel mixing. SOTA results: 94.6%/82.7% on Cholec80, 89.5%/68.9% on AutoLaparo, 238.74 fps on single GPU.

ReasoningBenchmarksVision
SIG
82
HYP
15
arXiv cs.AI·

An Amortized Efficiency Threshold for Comparing Neural and Heuristic Solvers in Combinatorial Optimization

Paper evaluating energy efficiency of neural vs heuristic combinatorial solvers. Defines Amortized Efficiency Threshold (AET): deployment volume where neural network training cost breaks even. On CVRP (n=50), attention-based solver from Kool et al. (2019) reaches energy parity at ~4560 deployed instances. Per-instance neural-to-heuristic ratio: 2.29e-3.

BenchmarksReasoningOpen source
SIG
75
HYP
15
arXiv cs.AI·

Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

Softmax-attention transformers can implement preconditioned Richardson iteration for in-context Gaussian kernel regression. Authors construct a single-head transformer with O(log(1/ε)) blocks achieving ε-accurate prediction on prompts of length N, where softmax attention produces a Gaussian-kernel operator and ReLU MLP layers perform local scalar arithmetic.

ReasoningPapersBenchmarks
SIG
78
HYP
15
arXiv cs.AI·

Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using Large Language Model Judges with Closed-Loop Reinforcement Learning Feedback

Behavioral evaluation methodology for agentic AI systems: scoring intermediate decisions via LLM judge ensemble across 6 dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery). Behavioral score correlates at rho=0.72 with Sharpe ratio. Closed-loop reinforcement (SAC) reduces MAPE from 0.61% to 0.54% on 2017-2025 test set.

AI AgentsReinforcement learningEvals
SIG
78
HYP
15
arXiv cs.AI·

One-Block Transformer (1BT) for EEG-Based Cognitive Workload Assessment

1BT (One-Block Transformer) is a compact model for EEG-based cognitive workload assessment. With <0.5M parameters and 0.02 GFLOPs, it uses a minimal latent bottleneck and single cross-attention module. Tested on 11 participants (abstract reasoning, numerical problem-solving, video game), it achieves high workload classification performance for real-time deployment in resource-constrained settings.

ReasoningBenchmarksPapers
SIG
72
HYP
18
arXiv cs.AI·

MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

MedSynapse-V proposes a framework for latent diagnostic memory evolution in medical imaging. Through meta-query memorization and Causal Counterfactual Refinement (CCR), the model dynamically synthesizes implicit diagnostic memories aligned with clinical logic. Evaluations across multiple datasets show significant gains over state-of-the-art methods.

VisionReasoningReinforcement learning
SIG
72
HYP
35
arXiv cs.CL·

Leveraging Multimodal Self-Consistency Reasoning in Coding Motivational Interviewing for Alcohol Use Reduction

arXiv study on automating Motivational Interviewing (MI) session coding for alcohol use reduction. Uses audio-language models (ALMs) with 4 complementary analytic prompts and self-consistency (12 reasoning trajectories per utterance). On 5 sessions, achieves 52.56% accuracy, 54.03% precision, 47.45% recall, macro-F1 46.40%, exceeding baselines.

ReasoningVoicePapers
SIG
45
HYP
25
arXiv cs.AI·

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

Study of implicit bias of Sharpness-Aware Minimization (SAM) on linear diagonal networks for binary classification. For L=1, both ℓ∞-SAM and ℓ2-SAM recover ℓ2 max-margin classifier like gradient descent. At L=2, ℓ2-SAM exhibits "sequential feature amplification": predictor initially relies on minor coordinates then shifts to major ones, contrasting with GD behavior.

ReasoningPapers
SIG
72
HYP
15
arXiv cs.AI·

Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

Systematic evaluation of multimodal LLMs on video anomaly detection (VAD) using ShanghaiTech and CHAD benchmarks. Models exhibit conservative bias in zero-shot settings: high precision but recall collapse. Class-specific instructions improve F1-score from 0.09 to 0.64 on ShanghaiTech, yet recall remains a critical bottleneck for real-world surveillance.

VisionReasoningPrompt engineering
SIG
72
HYP
25