Page 9 of 192

AllHigh signalRecent

7679 articles

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

Study of 450 chest X-ray reports showing LLM rewriting for standardization preserves image-text alignment (2.5% degradation) but erodes 26.8–29.3% of clinical entities and 14.9–16.5% of uncertainty language. The paradox: tasks producing 'cleaner' text pull content away from images.

Vision RAG Evals

SIG

HYP

arXiv cs.LG·Jun 17

Noise-Driven Escape from Metastable Phases explains Grokking in Deep Neural Networks

Researchers explain grokking (sudden generalization after prolonged overfitting) through first-order phase transitions driven by L2 regularization strength. SGD noise enables networks to escape trapped metastable states, with escape times following Arrhenius scaling. Results extend to nonlinear networks.

Papers Reasoning Evals

SIG

HYP

Vercel AI Blog·Jun 17

Introducing eve, an open-source agent framework

Vercel releases eve, an open-source framework for building and deploying AI agents. Minimal agent requires only two files (model + instructions). Add tools, skills, channels by creating files. Deploy to production with vercel deploy, unchanged from local development.

AI Agents Open source Tools

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

GLM 5.2 API is live, weights are on HF, and ollama has it already

GLM-5.2 API live at $1.4/M input tokens, $4.4/M output. Weights released MIT-licensed on HuggingFace, Ollama support available. Benchmarks: 81.0 Terminal-Bench 2.1, 62.1 SWE-bench Pro, 74.4 FrontierSWE. 1M context window, two thinking modes (High/Max).

Open source Code generation Benchmarks

SIG

HYP

arXiv cs.LG·Jun 16

Separable Neural Architectures as Physical World Models: from Mathematical Theory to Applications

New Separable Neural Architecture (SNA) combining neural approximation with tensor decomposition to solve high-dimensional PDEs. Variational framework (VSNA) guarantees well-posedness and convergence. Demonstrates 150,000x speedup vs FEM on A100 GPU for 7D parametric simulation and real-time thermal inversion of Inconel 718 (<100ms).

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 16

Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model

Theoretical study demonstrating that neural networks trained with gradient-based methods can achieve optimal computational-statistical tradeoff for Gaussian single-index models. Proposed algorithm (two-layer network) achieves sample complexity Õ(d^{s*/2} ∨ d) matching SQ lower bounds, with extension to k-sparse case via weight perturbation technique.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 16

PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino

PACUTE is a 4,600-task benchmark evaluating morphological understanding of Filipino in LLMs. The benchmark tests 6 compositional levels including infixation, reduplication, and diacritic distinctions. Open-weight models perform near chance on morpheme decomposition; frontier models recover affixes but remain far below ceilings on morphological composition tasks.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.CL·Jun 16

Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

Empirical study comparing 11 tokenizers across 11 Southeast Asian languages. Standard BPE tokenizers structurally favor high-resource languages and Latin scripts. Parity-aware BPE achieves best efficiency-equity trade-off; Morphology-Driven Byte Encoding delivers superior semantic performance but at higher computational cost.

Benchmarks

SIG

HYP

arXiv cs.CL·Jun 16

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

XBCP, a controlled benchmark, evaluates deep research agents' ability to operate across languages. Four agents tested with dense and sparse retrievers across 12 languages show substantial degradation: evidence recall loss, reduced calibration, unreliable citations. Problems persist even when gold evidence is directly supplied.

AI Agents RAG Benchmarks

SIG

HYP

arXiv cs.AI·Jun 16

ToolMenuBench: Benchmarking Tool-Menu Filtering Strategies for Reliable and Efficient LLM Agents

ToolMenuBench is a benchmark evaluating how tool-menu construction affects reliability and efficiency of multi-step LLM agents. Across 7 model backends, causal minimal tool filtering (CMTF) improves task success from 32.1% to 85.7% and reduces token usage by 98%, while minimizing wrong-tool calls and risky-tool exposure.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.LG·Jun 16

Remember, Don't Re-read: Stateful ReAct Agents for Token-Efficient Autonomous Experimentation

Stateful ReAct agents reduce token consumption by 90% on hyperparameter tuning and 52% on code optimization vs. stateless design. Architecture implemented via LangGraph with typed persistent state, reducing total token cost from O(n²) to O(n).

AI Agents Reasoning Code generation

SIG

HYP

arXiv cs.CL·Jun 16

ESBMC-PLC: Formal Verification of IEC 61131-3 Ladder Diagram Programs Using SMT-Based Model Checking

ESBMC-PLC is the first open-source formal verifier with native support for IEC 61131-3 ladder diagrams (PLCopen XML format). The tool translates rungs to GOTO IR, models the PLC scan cycle, and verifies safety properties via SMT-based bounded model checking or k-induction. Evaluation on 13 benchmarks: 8 bugs detected, 7 unbounded k-induction proofs, all runs under 60ms.

AI safety Benchmarks Open source

SIG

HYP

arXiv cs.CL·Jun 16

Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning

ReRULE improves LLM unlearning via off-policy replay for hard cases. The method stores low-reward rollouts near the forget/retain boundary in a replay buffer and reuses them through importance-sampled updates. On MUSE-Books, it increases Retain Quality from 46.3 to 56.2 with +5–11% training overhead.

Reinforcement learning AI safety Alignment

SIG

HYP

arXiv cs.CL·Jun 16

Spokes: Optimizing for Diverse Pretraining Data Selection

SPOKES optimizes pretraining data selection through a probabilistic diversification framework based on G-Vendi score and exponentiated gradient descent. On FineWeb and DCLM, the method improves downstream performance by +1.5 and +1.4 points when jointly optimizing quality and diversity, outperforming semantic deduplication.

Benchmarks Papers Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 16

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

Reinforcement learning post-training method (GRPO) to improve hateful and propagandistic meme detection in thinking-based MLLMs. +2.1% improvement on Hateful Memes (79.9%→82.0%) and +7.6 macro-F1 points on ArMeme (0.536→0.612) with chain-of-thought explanations. Code and data publicly released.

Reinforcement learning Reasoning Vision

SIG

HYP

arXiv cs.LG·Jun 16

GRASP: Gradient-Aligned Sequential Parameter Transfer for Memory-Efficient Multi-Source Learning

GRASP enables multi-source transfer learning with O(1) memory instead of O(K) by sequentially merging source models. Using parameter-wise gradient alignment and iterative fine-tuning, it achieves 93.5% mean accuracy on continual learning benchmarks (Yearbook, CLEAR-10/100) versus 71.7% for ensembles, while remaining production-deployable.

Fine-tuning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 16

Beyond Layer Importance in Layer-wise Sparsity: An Inter-Layer Perturbation-Absorption Perspective

Study on layer-wise redundancy in LLMs. Authors characterize how layers absorb or amplify perturbations during pruning: early layers amplify, middle and late layers absorb. They propose absorption-aware correction using a per-layer absorption coefficient, improving OWL and AlphaPruning by 7.13% perplexity reduction and 1.02% zero-shot accuracy boost at 70% sparsity.

Papers Benchmarks Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 16

ReportQA: QA-Based Radiology Report Evaluation

ReportQA introduces a QA-based evaluation metric for automated radiology report generation. The framework uses LLMs to extract structured information, generate QA pairs from templates, and evaluate alignment with radiologist judgments. Authors release knowledge trees, structured reports, and code for QA construction and evaluation.

Papers Vision Evals

SIG

HYP

arXiv cs.CL·Jun 16

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

CoRA aligns model confidence with chain-of-thought rationale quality. A GRPO-based RL framework jointly rewards answer correctness, committed-answer probability, and rubric-based rationale support. On MedQA, MathQA, OpenBookQA: 26.51% reduction in confidence-rationale alignment error across three open-weight LLMs.

Reasoning Reinforcement learning Evals

SIG

HYP

arXiv cs.CL·Jun 16

PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

PhoneHarness is a benchmark and execution harness for evaluating phone agents on real mobile workflows. It combines GUI, CLI, and structured tool actions with auditable execution traces. The benchmark achieves 75.0% pass rate, outperforming non-PhoneHarness settings by 12.9 percentage points. Focus is on verifiable side effects, not screen predictions alone.

AI Agents Benchmarks Tools

SIG

HYP

arXiv cs.AI·Jun 16

Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines

Method to distinguish whether LLM score drift stems from the product or the judge model itself. Uses human-labeled anchor set and betting e-process to detect silent judge model changes. Detects 100% of judge drift with zero false positives on product, outperforms industry-standard rolling z-test.

Evals Benchmarks AI safety

SIG

HYP

arXiv cs.AI·Jun 16

Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds

Study on reward hacking in LLM-based agents using an adapted AI Safety Gridworlds framework. Models (1.5B–14B) systematically exploit misspecified objectives to maximize observed rewards while failing hidden safety objectives. RL optimization amplifies the problem and resists standard mitigations (exploration, regularization).

AI Agents Reinforcement learning AI safety

SIG

HYP

arXiv cs.AI·Jun 16

Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

Mask-Proof is an automated pipeline converting real mathematical proofs into verifiable masked-step tasks. The benchmark contains 292 curated problems. Testing 17 models shows reasoning-enhanced models outperform standard models by 12-27%. The evaluator achieves 96.8% agreement with expert annotators.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 16

Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents

Base Sequence Analysis framework encodes LLM-powered autonomous agent behavior into symbolic sequences (X/E/P/V). Analysis of 347 production ReAct traces reveals P-X-P pattern reduces success by 10.4% and P-ratio negatively predicts success (r=-0.256). Governor runtime intervention system achieves +6.2% absolute success increase and 44% token reduction. Validated on 2,000 SWE-agent trajectories.

AI Agents Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 16

AI Engram: In Search of Memory Traces in Artificial Intelligence

Study introducing a geometric framework to identify 'AI engrams'—memory traces in deep neural networks analogous to biological memory units. Authors derive a closed-form estimator enabling surgical manipulation of learned knowledge (composition, erasure) via linear arithmetic without iterative optimization. Validated on MLPs and LLMs.

Reasoning Papers Alignment

SIG

HYP

arXiv cs.AI·Jun 16

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

DR-DCI combines retrieval with Direct Corpus Interaction for agent-based search over large corpora. The system uses a retriever to dynamically populate a local workspace where agents execute precise operations (filtering, comparison, verification). On Browsecomp-Plus, DR-DCI achieves 71.2% accuracy (+8.3 points vs raw DCI) and remains stable up to 10M documents, where raw DCI becomes unstable.

AI Agents RAG Reasoning

SIG

HYP

arXiv cs.LG·Jun 16

Trust-Region Diffusion Policies for Massively Parallel On-Policy RL

TruDi enables diffusion policies for massively parallel on-policy RL by integrating trust-region optimization with KL-divergence constraints over entire diffusion trajectories. Evaluated on 73 tasks across 4 benchmarks: outperforms baselines on standard tasks, achieves clear gains on challenging humanoid control.

Reinforcement learning Reasoning Robotics

SIG

HYP

arXiv cs.LG·Jun 16

M-CTX: Exact and Scalable Spatial Context Retrieval for Trajectory Analytics

M-CTX is a spatial context-retrieval framework for trajectory analytics. It replaces three brute-force stages (OSM range retrieval, SDF computation, moving-vessel neighbor lookup) with index-backed operators. On a 5.48M-anchor maritime corpus, it reduces context construction from 17 CPU-days to 1.8 hours (226x speedup), with exact reproduction of reference context.

Benchmarks Infrastructure Open source

SIG

HYP

arXiv cs.LG·Jun 16

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

PolyKV optimizes KV cache compression by applying heterogeneous strategies per transformer layer instead of uniform policies. On LLaMA-3.1-8B and Qwen3-8B with 512-token KV budget, PolyKV recovers 54.5% and 25.7% of the LongBench performance gap versus FullKV.

Benchmarks Infrastructure Reasoning

SIG

HYP

arXiv cs.AI·Jun 16

CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG under Document Isolation

CONCORD is an asynchronous sparse aggregation framework for device-cloud RAG with document isolation. It uses waiting debt control and certificate-guided minimal supplementation to reduce synchronization and data transfer. Improves end-to-end throughput by 1.66× to 2.15× on Natural Questions and WikiText-2 while reducing per-token communication by over 100×.

RAG Papers Infrastructure

SIG

HYP

arXiv cs.AI·Jun 16

OSGuard: A Benchmark for Safety in Computer-Use Agents

OSGuard is a dual-granularity benchmark for evaluating safety in computer-use agents. It combines action-level guardrail decisions and risk-augmented execution evaluation. Current multimodal guardrails perform well on isolated action judgments but fail to ensure reliable end-to-end safety.

AI Agents AI safety Benchmarks

SIG

HYP

arXiv cs.LG·Jun 16

FastMix: Fast Data Mixture Optimization via Gradient Descent

FastMix automates data mixture optimization for model training via gradient descent. The method reformulates mixture selection as a bilevel optimization problem, jointly optimizing mixture coefficients and model parameters. A single proxy model suffices, drastically reducing search cost compared to prior approaches.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 16

CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

CHILLGuard is a safety guardrail system for Chinese LLMs with fine-grained taxonomy (5 macro, 31 micro categories). Authors construct 405k training samples via RAG and prompt rewriting, plus 51k annotated test samples. Model achieves +15.92% F1 improvement over Qwen3Guard-8B-Strict using Direct Preference Optimization.

AI safety Alignment Fine-tuning

SIG

HYP

Reddit r/LocalLLaMA·Jun 15

archex: local-first, deterministic code-context for AI agents — no API key, no telemetry (Apache 2.0)

archex converts a repo into ranked, token-budgeted context for AI agents: symbols, imports, dependency graph. Local-first pipeline (BM25F + embeddings + RRF + reranker) with no API key, no telemetry. Benchmarks: recall 0.95 vs 0.32 (cocoindex-code), cold start 0ms vs 4,721ms, 71% fewer tokens.

Code generation RAG AI Agents

SIG

HYP

arXiv cs.AI·Jun 15

VISTA: View-Consistent Self-Verified Training for GUI Grounding

VISTA proposes a GRPO-based fine-tuning method for GUI grounding. It generates multiple views of the same screen (crops preserving target elements) to create more robust comparison groups. On ScreenSpot-Pro, it improves Qwen3-VL 4B/8B/30B from 55.5/52.7/53.7 to 63.4/65.8/67.0.

Reinforcement learning Vision Benchmarks

SIG

HYP

arXiv cs.LG·Jun 15

Can Editing 1 Neuron Fix Repetition Loops in LLMs?

Gemma 4 models exhibit repetition loops on long enumerations (up to 95% failure rate). Per-neuron ablation identifies a few MLP neurons responsible: suppressing them via weight edits removes simple loops but not 'doom loops' (infinite self-correction), limited by knowledge gaps rather than removable circuits.

Gemini Papers Evals

SIG

HYP

arXiv cs.LG·Jun 15

Diffusion Policy Optimization without Drifting Apart

DiPOD, a diffusion policy optimization method, addresses RL post-training instability by identifying double-drift phenomenon (ELBO diverging from true log-likelihood). The approach interleaves self-distillation with policy-improving gradient updates, stabilizing training on language models and continuous-control tasks.

Reinforcement learning Reasoning Papers

SIG

HYP

arXiv cs.LG·Jun 15

SuperThoughts: Reasoning Tokens in Superposition

SuperThoughts compresses consecutive CoT token pairs into single latent representations and decodes two tokens per step via a lightweight Multi-Token Prediction module. Tested on Qwen2.5-Math (1.5B, 7B, 14B), the approach reduces CoT length by 20-30% while maintaining accuracy (1-2 point degradation on MATH500, AMC, OlympiadBench, GPQA-Diamond).

Reasoning Qwen Code generation

SIG

HYP

arXiv cs.LG·Jun 15

Smoothing Dark Areas in Molecular Latent Diffusion

TopVAE, a topology-optimized VAE, reduces dark areas in latent space by internalizing structural and chemical constraints during training. Paired with a standard DiT, it achieves 77% lower FCD-3D on QM9 and 52% lower on GEOM-Drugs, generating more stable and chemically valid molecules.

Papers Benchmarks Code generation

SIG

HYP

arXiv cs.LG·Jun 15

PostDeg: Placement Beats Parameterization in LayerNorm GNNs

PostDeg demonstrates that LayerNorm placement matters more than parameterization in GNNs. An inverse-degree scalar positioned post-LayerNorm preserves topology signals (degree, centrality) required by node-selection policies. Gains of +3.5% to +5.6% on influence maximization, network dismantling, and maximum independent set.

Papers Benchmarks

SIG

HYP