Page 15 of 192

AllHigh signalRecent

7679 articles

MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

MIRAGE is a framework for mobile agents that learns continuous latent reasoning representations from visible textual reasoning traces. It transfers explicit reasoning into compact hidden states, reducing token generation by 3-5x on AndroidWorld while matching performance and improving the baseline by 10.2 points.

AI Agents Reasoning Code generation

SIG

HYP

arXiv cs.AI·Jun 4

SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification

SCI-PRM is a process reward model trained on SCIPRM70K, a 70K-trajectory dataset of scientific reasoning interleaved with tool execution. It supervises tool selection, execution accuracy, and result interpretation. Tested on biology, chemistry, physics: improves test-time scaling and provides dense reward signal for RL.

Reasoning Evals Reinforcement learning

SIG

HYP

arXiv cs.LG·Jun 4

dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

dMX is a differentiable mixed-precision quantization framework for learnable floating-point bit-width assignment across LLM layers. Tested on Llama, Qwen3, and SmolLM2 using the MXFP standard (Open Compute Project), it optimizes layer formats continuously then discretizes via annealing, outperforming KL-divergence heuristics on WikiText-2 and zero-shot reasoning benchmarks.

Llama Qwen Benchmarks

SIG

HYP

arXiv cs.CL·Jun 4

MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A

MM-BizRAG improves multimodal retrieval-augmented generation for complex enterprise documents. The system explicitly extracts document structure via orientation-specific ingestion pipelines (layout-aware parsing for vertical reports, holistic page representations for horizontal slide decks), then assembles multimodal contexts at inference. Up to 32% gains on SlideVQA and FinRAGBench-V without fine-tuning.

RAG Vision Benchmarks

SIG

HYP

arXiv cs.CL·Jun 4

A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

Systematic study of positional bias in multi-video summarization with MLLMs. Benchmark on ActivityNet and News videos (2-4 inputs). Evaluation of 9 models (open-source and proprietary) using three metrics: Coverage, Directional Positional Bias, Middle-Edge Gap. Finding: positional effects are domain- and model-dependent; increasing visual budget does not uniformly remove imbalance.

Vision Benchmarks Evals

SIG

HYP

arXiv cs.CL·Jun 4

Parameter-Efficient Fine-Tuning with Learnable Rank

LR-LoRA introduces learnable adapter rank during training, replacing LoRA's fixed-rank constraint. Rank varies per layer: attention and MLP layers show systematically different rank preferences. Outperforms LoRA and PEFT baselines on language understanding and commonsense reasoning benchmarks.

Fine-tuning Papers Benchmarks

SIG

HYP

arXiv cs.CL·Jun 4

Cartridges at Scale: Training Modular KV Caches over Large Document Collections

Cartridges at Scale (CAS) trains modular, reusable KV caches for large document collections, eliminating costly prefilling. The framework dynamically manages hundreds of per-document cartridges with GPU/storage rotation, scaling to over one million tokens. Performance: +10-31 points vs monolithic cartridge, 3-4x fewer tokens than conventional RAG.

RAG Reasoning Papers

SIG

HYP

arXiv cs.LG·Jun 4

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Systematic study of QKV variants in transformers. Authors test three projection sharing constraints (Q-K=V, Q=K-V, Q=K=V) on synthetic tasks, vision, and language models (300M-1.2B parameters). Q-K=V reduces KV cache by 50% with 3.1% perplexity degradation. Combined with GQA/MQA, achieves 87.5-96.9% cache reduction for edge inference.

Reasoning Benchmarks Open source

SIG

HYP

arXiv cs.AI·Jun 4

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

CHARM detects and mitigates cascading hallucinations in multi-step RAG pipelines, where early errors propagate and amplify across reasoning stages. Four-component framework: stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, cascade resolution. 89.4% detection rate on HotpotQA/MuSiQue/2WikiMultiHopQA, 82.1% error propagation reduction.

RAG AI Agents Reasoning

SIG

HYP

arXiv cs.LG·Jun 4

LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection

LiftQuant introduces continuous bit-width quantization for LLMs via a "lift-then-project" mechanism that projects a 1-bit lattice from a higher-dimensional lifted space. Effective bit-width is controlled by the dimension ratio, enabling quasi-continuous tuning. A 70B model compressed to 2.4 bits on 24GB GPU outperforms existing 2-bit models.

Benchmarks Open source

SIG

HYP

arXiv cs.CL·Jun 4

GENEB: Why Genomic Models Are Hard to Compare

GENEB is a large-scale diagnostic benchmark evaluating 40 genomic foundation models across 100 tasks in 13 functional categories under a unified probing protocol. Analysis shows aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides modest and inconsistent gains, and architectural alignment frequently outweighs parameter count.

Benchmarks Papers Evals

SIG

HYP

arXiv cs.LG·Jun 4

LLM Compression with Jointly Optimizing Architectural and Quantization choices

Differentiable NAS framework for LLM compression jointly optimizing architecture and mixed-precision quantization of linear layers. Results: 1.4x faster inference than sequential NAS-then-quantization baseline, or 6% higher average accuracy across seven reasoning tasks at equivalent latency.

Reasoning Benchmarks Fine-tuning

SIG

HYP

arXiv cs.LG·Jun 4

Edge of Stability Selectively Shapes Learning Across the Data Distribution

The study shows that edge of stability (EoS) selectively redistributes learning across data subgroups. Two conditions enable a group to benefit: alignment of its aggregate gradient with the top Hessian eigenvector, and sustained non-vanishing gradient magnitude. Under cross-entropy loss, gradient saturation favors output-outliers while suppressing progress on confidently classified groups.

Papers Reinforcement learning Evals

SIG

HYP

arXiv cs.CL·Jun 4

Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit

Fine-tuned RoBERTa outperforms zero-shot LLMs (Claude Haiku, Gemini Flash) on Reddit misinformation response classification: 0.62 macro-F1 vs 0.50. Model scaling (Llama-3-70B vs 8B) provides no benefit. Safety alignment in commercial models degrades belief detection (0.17 for Claude Sonnet).

Fine-tuning Benchmarks AI safety

SIG

HYP

arXiv cs.CL·Jun 4

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

LazyAttention is an attention mechanism optimizing key-value caching for RAG and in-context learning by deferring positional encoding. It enables zero-copy KV reuse at arbitrary positions, reducing time-to-first-token by 1.37× and increasing throughput by 1.40× versus Block-Attention.

RAG Reasoning Infrastructure

SIG

HYP

arXiv cs.LG·Jun 4

Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

Recover-LoRA extends a data-free accuracy recovery method to 2-bit quantized LLMs. A mixed-precision strategy selectively quantizes MLP gate/up layers to W2 while keeping others at W4, achieving 7.5–23.3% throughput gains. Low-rank adapters trained via logit distillation on synthetic data recover 80–95% accuracy on Qwen3-4B using only 10k samples.

Fine-tuning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 4

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

SGDR, an online skill learning method, enables web agents to reuse sub-procedures at each execution step. Unlike static approaches, SGDR dynamically retrieves skills based on current webpage state and task goal. On WebArena, it achieves 37.5% success with GPT-4.1 and 24.3% with Qwen3-4B, outperforming strongest baselines by 10.6% and 10.0% respectively.

AI Agents Benchmarks Code generation

SIG

HYP

arXiv cs.AI·Jun 4

Can Generalist Agents Automate Data Curation?

Curation-Bench evaluates whether generalist AI agents can automate training data curation. Agents reach published baselines within ten iterations but tend toward local policy variants. With scaffolding requiring method citation and adaptation, an agent autonomously composes a data-selection policy outperforming strong baselines at one-tenth their data budget.

AI Agents Benchmarks Code generation

SIG

HYP

arXiv cs.LG·Jun 4

Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent

Pseudospectral theory for coupled gradient descent (bilevel optimization, two-time-scale stochastic approximation, adversarial training). Kreiss constant bound K(J) ≤ 2/(1-γ) + ‖C‖/(4(1-γ)) for block-triangular Jacobians. Finite-horizon iteration complexity O(K(J)² log(1/δ)). Experiments on linear-quadratic problems and neural network training.

Reinforcement learning Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 4

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

LDARNet, a 120M-parameter genomic foundation model, introduces learnable adaptive tokenization without supervision to replace fixed schemes (k-mers, BPE). Combining BiMamba-2 with local attention and bidirectional routing, it achieves state-of-the-art on 5 histone modification tasks and outperforms models 20× larger at matched compute.

Papers Benchmarks Fine-tuning

SIG

HYP

arXiv cs.LG·Jun 4

When Autoregressive Consistency Hurts Safety Alignment

Researchers show LLM safety alignment is fragile because concentrated on early tokens. Autoregressive consistency mechanism allows attacks to insert harmful sequences at any position and sustain them. They propose adversarial safety alignment with random worst-insertion training to break this consistency.

AI safety Alignment Reasoning

SIG

HYP

arXiv cs.CL·Jun 4

When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG

Large-scale study across 5 models (7B-72B parameters), 10 biomedical QA datasets, and 4 retrieval methods. RAG yields only +1-2 point improvements over no-retrieval baseline. Backbone model choice has larger impact than retriever or corpus selection. Bottleneck: model's limited ability to effectively use retrieved evidence.

RAG Benchmarks Reasoning

SIG

HYP

arXiv cs.CL·Jun 4

POLARIS: Guiding Small Models to Write Long Stories

POLARIS is a reinforcement optimization method (GRPO) to improve long-form text generation in small models. Applied to Qwen3.5-9B with 1.4K prompt-story pairs and 4 A100 GPUs, it uses a frontier LLM judge and human-reference injection. POLARIS-9B rivals models 3× larger and generalizes to stories 3× longer than training data.

Qwen Reinforcement learning Fine-tuning

SIG

HYP

arXiv cs.AI·Jun 4

AIP: A Graph Representation for Learning and Governing Agent Skills

AIP (Agent Instruction Protocol) models agent skills as directed execution graphs with deterministic nodes and typed edges validated by YAML schema. On 27 SkillsBench tasks, Claude Sonnet improves from 0.60 to 0.71 mean reward and 53% to 67% pass rate. The graph structure enables precise failure diagnosis and iterative skill improvement.

AI Agents Claude Anthropic

SIG

HYP

arXiv cs.AI·Jun 4

Beyond Objective Equivalence: Constraint Injection for LLM-Based Optimization Modeling on Vehicle Routing Problems

VRPCoder, an 8B model, translates natural-language vehicle routing scenarios into Gurobi code. Authors propose constraint injection to verify constraints are neither silently omitted nor spuriously added. With GRPO, VRPCoder reaches 93% Pass@1, outperforming Claude-Sonnet-4.5 by 28 points on VRP benchmarks.

Code generation Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 4

Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

Researchers demonstrate that aligned LLMs remain vulnerable to token injections at any generation step, not just early tokens. Alignment with internal refusal directions does not predict robustness. Training directly on perturbed generation trajectories improves resistance to mid-sequence attacks.

AI safety Alignment Reasoning

SIG

HYP

arXiv cs.AI·Jun 4

FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games

FALSIFYBENCH evaluates inductive reasoning in 12 LLMs through rule discovery games inspired by the Wason 2-4-6 task. Reasoning models outperform instruction-tuned models, but none approach optimal performance. Success depends primarily on capacity for negative testing and hypothesis falsification.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 4

Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

CAPR is a reinforcement learning algorithm for diffusion language models that leverages the denoising trace to generate fine-grained supervision signals without full tree expansion cost. The approach reduces rollout cost to 0.75x flat methods and 0.6x tree methods, achieving SOTA on 4x4 Sudoku, Countdown, GSM8K, and Math500.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

The Decoder·Jun 3

Google Deepmind's Gemma 4 12B squeezes multimodal AI onto a laptop with just 16 GB of RAM

Google DeepMind releases Gemma 4 12B, an open-source multimodal model (text, images, audio) running on laptops with 16 GB RAM. Performance nearly matches the 26B model, Apache 2.0 license for commercial use.

DeepMind Open source Vision

SIG

HYP

arXiv cs.LG·Jun 3

Spectral Asymptotics of Neural Network Loss Landscapes: An Exact Decomposition of the Curvature Exponent

Theoretical study of neural network loss landscape geometry. Authors prove a Spectral Alignment Decomposition explaining why curvature exponent α varies across layer types (α≈2 convolutions, α≈1 transformer attention, α<1 MLP). Empirical validation on 93 layers, 5 architectures, 3 datasets with ~2% median error.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 3

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

Reward guidance algorithms steer generative processes toward reward-tilted measures. The paper shows reward hacking stems from finite-particle plug-in estimation of the Doob h-function in practical implementations. Authors propose a closed-form reward damping schedule and validate on Gaussian targets, 2D checkerboard, and FLUX.1 text-to-image generation.

Reinforcement learning Reasoning Papers

SIG

HYP

arXiv cs.CL·Jun 3

AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

Factorial study of 4 open-source LLMs rating clinical decisions in type 2 diabetes pharmacotherapy. LLMs as AI raters score 74–78 points under rubric-free protocol vs 7.69–49.64 points under anchored Gold Rubric. Rubric amplifies discrimination between CDSS models (1.76–5.10×) and reveals behavioral variation suppressed without rubric.

Evals Benchmarks AI safety

SIG

HYP

arXiv cs.CL·Jun 3

Experience-Driven Dynamic Exits for LLMs with Reinforcement Learning

LEDE, an offline reinforcement learning framework, optimizes LLM inference by dynamically selecting exit layer and speculation length based on local sequence context. On Llama-2 and Llama-3, it achieves 2.0×–2.7× speedup over autoregressive decoding, +17% over static speculative baselines.

Llama Reinforcement learning Code generation

SIG

HYP

arXiv cs.CL·Jun 3

The Deliberative Illusion: Diagnosing Factual Attrition and Stance Homogenization in Multi-Agent LLM Deliberation

Multi-agent LLM systems lose up to 72% of issue-critical facts during deliberation, creating a 'deliberative illusion'. DelibTrace measures factual attrition and stance homogenization. Agents converge toward consensus while forgetting essential elements needed to interpret the problem.

Multi-agent AI Agents Evals

SIG

HYP

arXiv cs.CL·Jun 3

Regret Pre-training: Bridging Prior and Posterior Views for Enhanced Knowledge Grounding

Regret Pre-training introduces a self-supervised framework based on LUPI using dual-view architecture generating Student (causal) and Teacher (future-conditioned) distributions. On OLMoE-1B-7B after 4B tokens, GlobalRegret and LocalRegret achieve 33.9% and 32.2% average accuracy vs 30.2% baseline, with 18.1pp gain on BoolQ. No additional parameters.

Papers Reasoning Fine-tuning

SIG

HYP

arXiv cs.LG·Jun 3

ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services

ReLoRA is an efficient re-adaptation framework for continuously evolving LLM services. It uses Bayesian optimization to initialize LoRA adapters compatible with base-model updates, then fine-tunes with scheduled regularization. Results: up to 8.9× reduction in time-to-readiness and up to 4.6% accuracy improvement.

Fine-tuning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 3

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

SkillDAG models inter-skill relationships as a typed directed graph for dynamic LLM agent skill selection at inference time. On ALFWorld and SkillsBench with MiniMax-M2.7, it achieves 67.1% success and 27.3% reward, exceeding Graph-of-Skills baselines by +12.8 and +8.6 points. The graph self-evolves during execution via a propose-then-commit protocol, accumulating structure across episodes.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 3

The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

arXiv paper proposing CLEAR, an optimal budget allocation method for LLM inference grounded in economic theory. Using a shifted-surge utility function and global shadow pricing, CLEAR performs rational abandonment and reallocates resources from insolvent to solvable queries. Results: 3x improvement in global accuracy vs uniform allocation under resource scarcity.

Reasoning Benchmarks Infrastructure

SIG

HYP

arXiv cs.CL·Jun 3

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

SEA-Embedding is an open and reproducible text-embedding pipeline for Southeast Asian languages trained exclusively on public data. The study examines three core factors: data composition, training objective, and base encoder initialization. Achieves state-of-the-art results on SEA-BED.

Embeddings Open source Papers

SIG

HYP

arXiv cs.AI·Jun 3

Decomposing how prompting steers behavior

Study of representational geometry to understand how prompts reshape behavior in LLMs and VLMs. Nested decomposition framework testing translation, rigid transformation, scaling, affine and nonlinear maps on 3 LLMs, 3 VLMs and 6 datasets. Finding: cross-dimensional linear mixing (affine transformation) is the key mechanism for representational reorganization toward task structure.

Prompt engineering Reasoning Papers

SIG

HYP