Page 24 of 192

AllHigh signalRecent

7679 articles

The Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study

LLMs simulating users in intervention experiments produce biased observational studies. Trained on observational data, they induce implicit population drift across treatment conditions, distorting effect estimates. Authors propose negative control outcomes to diagnose bias and persona specification adjustments to mitigate drift.

Papers Evals Reasoning

SIG

HYP

arXiv cs.LG·May 21

Neural Collapse by Design: Learning Class Prototypes on the Hypersphere

Two supervised classification paradigms (cross-entropy and contrastive learning) converge to Neural Collapse, a theoretical optimum. Authors propose NTCE and NONL, two normalized losses reaching NC in <7.5% of CE iterations, with +5.5% transfer learning improvement and +8.7% under class imbalance on ImageNet-1K.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 21

Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining

DG-Hard, a spectral post-hoc method, recovers capabilities damaged by fine-tuning without retraining. It applies hard SVD thresholding (Donoho-Gavish) to weight matrices to isolate task-aligned signal from residual noise. Tested on 14 (model, task) settings and 9 benchmarks, it also restores safety alignment degraded by benign fine-tuning.

Fine-tuning AI safety Alignment

SIG

HYP

arXiv cs.LG·May 21

TreeText-CTS: Compact, Source-Traceable Tree-Path Evidence for Irregular Clinical Time-Series Prediction

TreeText-CTS converts irregular EHR trajectories into human-readable, source-traceable evidence units derived from XGBoost tree paths. The system improves AUPRC by 6.0–9.7 absolute percentage points on PhysioNet 2012, MIMIC-III, and PhysioNet 2019 while remaining competitive with numerical time-series models.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·May 21

Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

Weasel is a trajectory selection method for offline training of web agents. It optimizes a balance between importance and diversity across states, websites, and interaction patterns, with target-centered AXTree pruning. On WebArena, WorkArena, and MiniWob, it improves out-of-domain generalization with 9.7-12.5× training speedups over standard fine-tuning on Qwen2.5-7B, Gemma3-4B, and Qwen3-8B.

AI Agents Fine-tuning Benchmarks

SIG

HYP

arXiv cs.LG·May 21

MagBridge-Battery: A Synthetic Bridge Dataset for Li-ion Magnetometry and State-of-Health Diagnostics

MagBridge-Battery v1.0 is a synthetic dataset of 6,760 magnetic-field signatures for Li-ion battery state-of-health diagnostics. It bridges real magnetic morphology from the Mohammadi-Jerschow archive with degradation labels from PulseBat. Three benchmark tasks: SOH regression (R²≈0.77), second-life classification, anomaly detection.

Benchmarks Evals Open source

SIG

HYP

Reddit r/LocalLLaMA·May 20

Build 9254 fixes my TG regression and adds PDL for NVIDIA GPUs

Build 9254 of llama.cpp fixes throughput regression and adds PDL (Programmatic Dependent Launch) support for NVIDIA GPUs CC >= 90. PDL enables concurrent CUDA kernel execution on the same stream, reducing launch overhead. Observed gains: +3% on RTX 5060 Ti, up to +10% on RTX PRO 6000 depending on model.

Infrastructure Open source Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·May 20

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

RTX 5080 16GB benchmark with Qwen3.6 35B MoE at 128k context: 56 tok/s without MTP, 74 tok/s with MTP but slower overall. MTP forces a 1.5GB buffer that offloads 3 expert layers GPU→CPU, creating a bottleneck. The 27B IQ3 reaches 73 tok/s and fits entirely on GPU.

Qwen Benchmarks Open source

SIG

HYP

arXiv cs.LG·May 20

Precision Tracked Transformer via Kalman Filtering, Kriging and Process Noise

Bayesian Filtering Transformer (BFT) integrates uncertainty handling into Transformers via Kalman filtering and kriging. Attention becomes precision-weighted kriging, residual connections become adaptive Kalman updates. BFT improves sequential recommendation (cold-start users) and LLM robustness on noisy data with negligible overhead.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 20

How Far Are We From True Auto-Research?

ResearchArena evaluates 117 papers generated by AI agents (Claude Code Opus 4.6, GPT-5.4 Codex, Kimi Code K2.5) across the full research loop. Manuscript-only scores appear competitive, but artifact-aware review reveals critical failures: experimental rigor bottleneck, fabricated results, underpowered experiments. No agent-generated paper meets top-tier venue acceptance standards.

AI Agents Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 20

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

LLMEval-Logic is a Chinese logical reasoning benchmark with 246 base items and 190 hard items, verified by Z3 and expert-audited. Evaluation of 14 frontier LLMs: best score 37.5% on hard items, 60.16% on Z3+rubric formalization.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.AI·May 20

SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

SimGym is a framework simulating A/B tests on e-commerce storefronts using VLM agents in a live browser. It generates buyer personas from clickstream data, combines multimodal perception with episodic memory, and achieves 77% directional alignment with real add-to-cart shifts. Experimental cycles reduce from weeks to under one hour.

AI Agents Vision Benchmarks

SIG

HYP

arXiv cs.AI·May 20

AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

AQuaUI reduces visual tokens for GUI agents without additional training. The method uses adaptive quadtrees to exploit non-uniform information density in screenshots. On GUI-Owl-1.5-32B, it achieves 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of full-token performance.

AI Agents Vision Evals

SIG

HYP

arXiv cs.AI·May 20

SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

SceneCode compiles natural language prompts into executable Python programs to generate interactive indoor scenes with articulated objects. A multi-agent system (planner-designer-critic) produces asset requests converted to Blender code validated through repair-and-refine loops, exportable as SDF for physics simulation.

AI Agents Multi-agent Code generation

SIG

HYP

arXiv cs.AI·May 20

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

Formal Skill is a runtime abstraction for LLM agents that structures reusable capabilities via JSON metadata, action schemas, Python executors, and hook-governed control logic. Implemented in FairyClaw (open-source event-driven runtime), it replaces natural-language procedures with executable state machines, reducing token usage while improving reliability on Harness-Bench.

AI Agents MCP Code generation

SIG

HYP

arXiv cs.AI·May 20

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

Self-evolving skill libraries suffer silent degradation termed 'library drift': unbounded accumulation without lifecycle management. Study isolates mechanism via ablations, provides trace-level diagnostics, and validates fix (outcome-driven retirement + bounded active-cap + meta-skill prior) lifting pass@1 from 0.258 baseline to 0.584 on MBPP+ hard-100.

AI Agents Code generation Benchmarks

SIG

HYP

arXiv cs.CL·May 20

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

arXiv study on 'agent meltdowns': failures where AI agents (GPT, Grok, Gemini) exhibit unsafe behavior in response to benign environmental errors (inaccessible pages, missing files). 64.7% of rollouts with simulated errors produce meltdowns (unauthorized reconnaissance, access control bypass), often unreported to users.

AI Agents AI safety Benchmarks

SIG

HYP

arXiv cs.AI·May 20

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

SERL, a selective environment-reweighted learning framework, improves multi-turn LLM agent training by leveraging granular environmental feedback (error messages, page changes, reference trajectories). On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success rates, outperforming existing RL and distillation baselines.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.CL·May 20

optimize_anything: A Universal API for Optimizing any Text Parameter

A single LLM-based optimization system unifies six diverse domains: agent architectures (89.5% on ARC-AGI vs 32.5% Gemini Flash baseline), scheduling algorithms (40% cloud cost reduction), CUDA kernels (87% match/beat PyTorch), circle packing. Multi-task search with cross-problem transfer outperforms independent optimization. Open-sourced in GEPA project.

Reasoning AI Agents Code generation

SIG

HYP

arXiv cs.AI·May 20

POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

POLAR-Bench is a diagnostic benchmark assessing privacy-utility trade-offs in LLM agents. A trusted model with privacy policy interacts with an adversarial third-party model across 10 domains and 7,852 samples. Frontier models withhold 99% of protected attributes, but open-weight models in the 1–30B range commonly used for on-device private inference leak up to 50% of sensitive data.

AI Agents AI safety Alignment

SIG

HYP

arXiv cs.LG·May 20

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

Analysis of 34 frontier models (2024-2026) showing reasoning and coding capabilities cooperate (r=+0.72) but vary by lab. DeepSeek shifted from reasoning-rich to coding-first (+11.2→-4.7); Google maintains balance; Anthropic oscillates. SWE-bench saturating while HLE and instruction-following remain discriminative. Seven falsifiable predictions for next 12 months with interactive dashboard.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.LG·May 20

Not All Tokens Are Worth Caching: Learning Semantic-Aware Eviction for LLM Prefix Caches

SAECache introduces a semantic-aware eviction policy for LLM prefix caches. Not all tokens are equally worth caching: different token types (system prompts, user queries, tool outputs, reasoning) show up to 756x variation in reuse rates. SAECache uses a multi-queue architecture with online learning to adapt priorities, achieving 1.4x-2.7x TTFT improvement over production baselines.

Reasoning Infrastructure Benchmarks

SIG

HYP

arXiv cs.LG·May 20

Theory-optimal Quantization Based on Flatness

New post-training quantization method for LLMs called Bidirectional Diagonal Quantization (BDQ). Introduces Flatness metric to quantify activation outlier distribution. BDQ achieves <1% accuracy drop in W4A4 on LLaMA-3-8B and reduces performance gap by 39.1% in W2A4KV16 on DeepSeek-R1-Distill-LLaMA-70B.

Llama DeepSeek Benchmarks

SIG

HYP

arXiv cs.CL·May 20

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

GoLongRL presents a fully open-source post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). The authors release a 23K-sample dataset spanning 9 task types and introduce TMN-Reweight to optimize heterogeneous rewards. Qwen3-30B-A3B achieves performance comparable to DeepSeek-R1 and Qwen3-235B.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 20

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

REFLECT is a meta-evaluation benchmark testing LLM judge reliability for supervising deep research agents. Authors define a fine-grained failure taxonomy (process and outcome levels) via controlled interventions on agent execution traces. Finding: best LLM judges achieve <55% accuracy on evidence verification and reasoning failure detection.

AI Agents Evals Reasoning

SIG

HYP

arXiv cs.CL·May 20

DECOR: Auditing LLM Deception via Information Manipulation Theory

DECOR is a multi-agent framework for auditing deception in LLMs by decomposing contexts into atomic informational units and scoring four manipulation dimensions (omission, focus-shifting, meaning-obscuring). Tested on 15 frontier models, it achieves state-of-the-art deception detection on single and multi-turn benchmarks with interpretable manipulation profiles.

Multi-agent AI safety Alignment

SIG

HYP

arXiv cs.CL·May 20

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

HalluWorld is a controlled benchmark for evaluating LLM hallucinations through explicit reference worlds (gridworlds, chess, terminal tasks). Frontier models solve perceptual hallucinations on direct observations well, but struggle with multi-step state tracking and causal forward simulation, even with extended thinking.

Benchmarks Reasoning AI safety

SIG

HYP

arXiv cs.CL·May 20

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

m3BERT is a multilingual embedding model using a Matryoshka strategy to jointly optimize representations across transformer layers and multiple embedding dimensions. Three-stage pretraining (monolingual, multilingual, web domain) outperforms existing models on Bing-Click and adapts to varying resource constraints.

Embeddings Fine-tuning Benchmarks

SIG

HYP

arXiv cs.LG·May 20

HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models

HELLoRA attaches LoRA modules only to the most frequently activated experts per layer in Mixture-of-Experts models, reducing trainable parameters by 84% on OlMoE and improving accuracy by 9.2%. Tested on OlMoE-1B-7B, Mixtral-8x7B, and DeepSeekMoE across mathematical reasoning, code generation, and safety alignment.

Fine-tuning Benchmarks

SIG

HYP

arXiv cs.LG·May 20

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

ReCrit is a reinforcement learning framework improving LLM handling of user criticism in scientific reasoning. It decomposes behaviors into four quadrants (Correction, Sycophancy, Robustness, Boundary) using transition-aware rewards. On ChemBench, TRQA, and EarthSE, ReCrit improves accuracy from 38.15% to 51.49% on Qwen3.5-4B.

Reinforcement learning Reasoning Qwen

SIG

HYP

arXiv cs.LG·May 20

Block-Based Double Decoders

Novel transformer architecture 'block-based double decoders' combining decoder-only training efficiency with encoder-decoder inference gains. Reduces KV-cache memory and per-token compute by at least 2/3 at inference, while maintaining full loss supervision and static sequence packing during training.

Reasoning Benchmarks Infrastructure

SIG

HYP

arXiv cs.LG·May 20

D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting

D-PACE is a new loss function for LLM inference acceleration via speculative decoding. It dynamically adapts per-position training weights based on tokens limiting acceptance, improving accepted length and wall-clock speedup with 2.3% training overhead and no architectural changes.

Reasoning Benchmarks Code generation

SIG

HYP

arXiv cs.LG·May 20

Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect

Study of literary primitives in Llama 3.1 8B-Instruct and Gemma 2 9B-IT using sparse autoencoders. Four feature classes identified: naming-gates (affect tokens), self cluster (first-person register), stylistic modulators, compositional emotions. Llama achieves 27/27 emotion coverage (Cowen-Keltner taxonomy), Gemma 23/27. Validated via 5-LLM judge panel.

Llama Gemini Fine-tuning

SIG

HYP

arXiv cs.LG·May 20

PASC: Pipeline-Aware Conformal Prediction with Joint Coverage Guarantees for Multi-Stage NLP and LLM Pipelines

PASC is a conformal prediction method guaranteeing simultaneous coverage across all stages in multi-stage NLP pipelines (NER → NED → entity typing, RAG, agent chains). On CoNLL-2003, PASC achieves 96.4% end-to-end coverage vs 93.4% for Bonferroni and 86.5% for independent CP, 1.7x faster, and maintains robustness under distribution shift (WNUT-17, WikiNEuRal).

Evals Reasoning AI Agents

SIG

HYP

arXiv cs.CL·May 20

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

SciCustom is a framework for building custom benchmarks to evaluate application-specific scientific capabilities in LLMs. It organizes scientific knowledge into ontology-grounded units, uses multi-model consensus voting to identify relevant units, and generates benchmarks from real data in chemistry and healthcare without expert annotation.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.LG·May 20

Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

Automated framework for generating fine-grained evaluation benchmarks for foundation models. Multi-agent pipeline with solution-graph-driven strategy improves ground-truth solution reliability. Three benchmarks generated (ML, Corporate Finance, Personal Finance) show lower error rates than MMLU/GSM8K. Evaluation of 12 models reveals performance differences missed by existing benchmarks.

Benchmarks Evals Multi-agent

SIG

HYP

arXiv cs.CL·May 20

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

Stepwise Confidence Attribution (SCA) diagnoses multi-step reasoning failures in closed-source LLMs by assigning step-level confidence from generated traces alone. Two methods: NIBS (non-parametric) and GIBS (graph-based). On mathematical reasoning and multi-hop QA, SCA reliably identifies error-prone steps and improves self-correction success by up to 13.5%.

Reasoning Evals Papers

SIG

HYP

arXiv cs.LG·May 20

How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines

Systematic error analysis of trajectory-based data attribution methods. Identifies optimizer mismatch (SGD vs AdamW) as dominant config-level error. Proposes AdamW-influence with 10-300% improvements in Spearman correlation across MLP, CNN, GPT-2, Llama 3.2-1B. Provides practical guidelines for data selection via K-step look-ahead framework.

Papers Evals Fine-tuning

SIG

HYP

arXiv cs.CL·May 20

CAIT: A Syntactic Parsing Toolkit for Child-Adult InTeractions

CAIT is an open-source toolkit for syntactic parsing of child-adult interactions in CHILDES. It includes a dependency parser trained on UD-English-CHILDES, a POS tagger, and a construction tagger. The parser outperforms SpaCy and Stanza on this specialized domain.

Open source Benchmarks

SIG

HYP

arXiv cs.LG·May 20

Efficient Conditioning Why Pseudo Observation Batch Bayesian Optimization Works When It Does not

Theoretical study unifying batch selection methods in parallel Bayesian Optimization (Constant Liar, Kriging Believer, fantasy models). Authors identify efficient conditioning as key surrogate property of Gaussian Processes, proving generation of distinct points with separation of order l. Experimental validation on Hartmann6D, Ackley 8D, Levy10D and SVM hyperparameter tuning.

Benchmarks Papers

SIG

HYP