Page 47 of 192

AllHigh signalRecent

7679 articles

A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE

Method to expand LLMs to new languages without costly alignment phase. Converts dense model to Mixture-of-Experts architecture with language-specific experts, then transfers alignment capabilities via post-training delta fusion. Improves performance on new languages while preserving original abilities.

Fine-tuning

SIG

HYP

arXiv cs.AI·May 19

Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity

Theoretical investigation of loss of plasticity (LoP) in deep learning under non-stationary environments. Authors identify two primary mechanisms: activation saturation and representational redundancy creating traps in parameter space. Paradox: properties promoting static generalization (low-rank representations) worsen LoP in continual learning.

Reinforcement learning Papers Alignment

SIG

HYP

arXiv cs.AI·May 19

Are Sparse Autoencoder Benchmarks Reliable?

Critical audit of SAEBench, the de-facto standard evaluation suite for sparse autoencoders (SAEs). TPP and SCR metrics fail multiple reliability tests and should not be used. Other metrics show higher reseed noise and lower discriminability than assumed. Only sae-probes demonstrates acceptable reliability, but struggles to distinguish architecture variants.

Evals Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

OPERA: A Reinforcement Learning--Enhanced Orchestrated Planner-Executor Architecture for Reasoning-Oriented Multi-Hop Retrieval

OPERA is a retrieval-augmented generation (RAG) architecture coupling planning and execution via reinforcement learning. A Goal Planning Module decomposes complex questions into sub-goals, executed by a Reason-Execute Module with specialized components for reasoning and retrieval. Training uses MAPGRPO, a GRPO variant. Superior results on complex multi-hop benchmarks.

RAG Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

UniER: A Unified Benchmark for Item-level and Path-level Exercise Recommendation

UniER is a unified benchmark for personalized exercise recommendation, comparing two paradigms: ILER (item-level) and PLER (path-level). The framework introduces Weighted Cognitive Gain (WCG) metric and evaluates 18 methods across 9 datasets. Results show systematic dominance of PLER and reveal ILER's pedagogical failures under extreme sparsity and noise.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.AI·May 19

Generative AI and the Productivity Divide: Human-AI Complementarities in Education

A randomized controlled experiment shows LLM access significantly increases average performance, but gains are unevenly distributed. AI Interaction Competence (ability to elicit, filter, and verify outputs) predicts benefits, not GPA. A scaffolding intervention (conceptual maps) reduces outcome variance.

Reinforcement learning Evals Alignment

SIG

HYP

arXiv cs.AI·May 19

Global Automation Atlas

Study of 124 countries covering 99% of global GDP. Task-based automation exposure measure: 3.3% in South Sudan to 61.6% in China. Distinguishes labor-substituting vs labor-augmenting automation. AI more prevalent in substitution in low-income countries, augmentation in high-income. Women disproportionately exposed to substitution.

Benchmarks Papers Regulation

SIG

HYP

arXiv cs.AI·May 19

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

New credit assignment method for reinforcement learning with LLMs. IBPO (Implicit Behavior Policy Optimization) uses counterfactual trajectories to convert sparse terminal rewards into step-sensitive learning signals, reducing gradient variance and improving stability on mathematical and code reasoning benchmarks.

Reinforcement learning Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model

TinySAM 2 compresses SAM 2 for efficient video segmentation. Memory quality management mechanism + joint spatial-temporal token compression. Achieves 90% of SAM 2.1 performance with 7% memory tokens and 3% training data. Reduces parameters, computational load, and deployment costs.

Vision Video generation Benchmarks

SIG

HYP

arXiv cs.CL·May 19

D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

D²Evo is an RL framework to enhance LLM reasoning. It addresses scarcity of medium-difficulty samples by mining anchors matched to model capability and training a Questioner to generate diverse questions at appropriate difficulty. Results: outperforms existing methods on math benchmarks with <2K real samples.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

LLM-Safety Evaluations Lack Robustness

arXiv paper argues current LLM safety evaluations lack robustness due to small datasets, methodological inconsistencies, and unreliable setups. Systematically analyzes the evaluation pipeline—dataset curation, automated red-teaming, response generation, LLM judges—and proposes guidelines to reduce noise and improve comparability of attack/defense research.

AI safety Alignment Evals

SIG

HYP

arXiv cs.AI·May 19

Scheduling That Speaks: An Interpretable Programmatic Reinforcement Learning Framework

ProRL is a programmatic reinforcement learning framework for combinatorial optimization (job shop scheduling). It generates interpretable policies as human-readable programs via a domain-specific language (DSL-S), exploring the program space through local search and Bayesian optimization. Outperforms classical heuristics and DRL baselines with minimal training episodes.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

Novel data selection strategy for LLM alignment based on DPO implicit reward gap. Method selects harder examples (smaller reward gaps) and achieves superior performance with only 10% of original data across multiple benchmarks.

Reinforcement learning Alignment Evals

SIG

HYP

arXiv cs.AI·May 19

QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

QSTRBench is a benchmark evaluating LLMs' ability to reason with qualitative spatial and temporal reasoning (QSTR). It covers 9 calculi (Point Algebra, Allen's Interval Algebra, RCC-5/8/22, etc.) with composition tables, converse relations, and conceptual neighbourhoods. Tested models outperform guessing but none answer all questions correctly. RCC-22 proves most difficult.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·May 19

The Unlearnability Phenomenon in RLVR for Language Models

Study reveals an 'unlearnability' phenomenon in Reinforcement Learning with Verifiable Reward (RLVR) for LLMs. Some hard examples remain unlearnable even with correct rollouts. Cross-example gradient analysis shows fundamental representation flaws: low gradient similarity and ungeneralizable reasoning patterns. Data augmentation fails to improve gradient similarity.

Reinforcement learning Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

EmoMind: Decoding Affective Captions from Human Brain fMRI

EmoMind decodes affective captions directly from brain fMRI signals. The system first retrieves a neutral scene description from brain-decoded visual features, then rewrites it using a continuous 34-dimensional emotion vector extracted from the same fMRI recording. Evaluated on two independent emotion fMRI datasets, EmoMind outperforms GPT-4 with discrete emotion labels across all validation axes.

Vision Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models

OCCAM is a framework for explaining black-box image classifier decisions through causal visual concepts. It discovers concepts in open-set manner, localizes them via text-guided segmentation, and measures causal contribution through object-level interventions. OCCAM aggregates interventional evidence to induce a structured ontology revealing concept dependencies and systematic model biases.

Vision Evals Reasoning

SIG

HYP

arXiv cs.AI·May 19

When Efficiency Backfires: Cascading LLMs Trigger Cascade Failure under Adversarial Attack

LLM cascade systems, designed to balance efficiency and performance by routing complex queries to powerful models, are vulnerable to targeted adversarial attacks. A novel attack exploits lightweight models and internal decision mechanisms to simultaneously degrade accuracy and cost-efficiency.

AI safety AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·May 19

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

HINT-SD proposes targeted self-distillation for training long-horizon LLM agents. The method uses full-trajectory hindsight to identify failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. On BFCL v3 and AppWorld, it improves over dense per-turn feedback baselines by up to 18.80% while achieving 2.26× lower time per training step.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

Universal Dynamics of Punctuated Progress

Analysis of 6.8M solutions across 6.7K tasks in 9 domains (materials, structural biology, AI, computational biomedicine, data science, theoretical CS, F1, wheel building). Three universal patterns: heavy-tailed waiting times, sublinear record accumulation, temporal correlation of breakthroughs. Minimal model unifies radical innovation and incremental refinement.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.CL·May 19

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

arXiv paper proposing formal framework for combining LLM and human evaluations. Uses doubly robust estimator (missing data literature) to determine optimal number of human reviews needed. Shifts LLM role from substitutive to auxiliary in two-stage sampling design.

Evals Benchmarks AI safety

SIG

HYP

arXiv cs.CL·May 19

Machine Unlearning for Masked Diffusion Language Models

First machine unlearning framework for masked diffusion language models (LLaDA, Dream). MDU minimizes KL divergence from prompt-conditional to prompt-masked unconditional distribution at each masked position, with temperature scaling for privacy-utility trade-off. Code released.

Papers AI safety Fine-tuning

SIG

HYP

arXiv cs.CL·May 19

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

EvoSynth, an autonomous multi-agent framework, optimizes jailbreak attacks in executable code space rather than prompt space. The system iteratively evolves and self-corrects code-based attack algorithms. Results: 85.5% Attack Success Rate against Claude-Sonnet-4.5, 95.9% average ASR across evaluated targets.

AI Agents Multi-agent Claude

SIG

HYP

arXiv cs.AI·May 19

AI for Auto-Research: Roadmap & User Guide

Comprehensive study of AI-assisted research systems through April 2026. LLMs excel at structured, retrieval-grounded, and tool-mediated tasks but remain fragile for genuinely novel ideas and scientific judgment. End-to-end autonomous systems have not yet consistently reached major-venue acceptance standards. Human-governed collaboration is the most credible deployment paradigm.

AI Agents Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

F³A is a training-free router for visual token pruning in vision-language models. It selects relevant visual tokens via question-conditioned cues without extra LLM forward passes, reducing inference costs while maintaining performance across model scales.

Vision Reasoning Infrastructure

SIG

HYP

arXiv cs.CL·May 19

Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles

Infini-News indexes 1.35B CC-News articles (August 2016–present) with metadata extraction, language detection (GlotLID, lingua, CommonLingua), and geographic attribution (83.4% coverage). Infini-gram suffix-array indexes enable sub-second full-text pattern search across the entire archive.

RAG Vector search Benchmarks

SIG

HYP

arXiv cs.AI·May 19

The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

TTG (Token Games) is an evaluation framework where language models challenge each other by creating programming puzzles. The system uses pairwise duels and Elo ratings to compare 10 frontier models. Results match existing benchmarks (Humanity's Last Exam) for under $200 USD without human puzzle curation.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

HTSC-2025: A Benchmark Dataset of Ambient-Pressure High-Temperature Superconductors for AI-Driven Critical Temperature Prediction

HTSC-2025 is an open-source benchmark of high-temperature superconducting materials discovered 2023-2025 (X₂YH₆ systems, MXH₃ perovskites, M₃XH₈, BCN-doped cage structures, 2D honeycomb). Addresses the lack of standardized datasets for fair comparison of AI algorithms predicting critical transition temperatures.

Benchmarks Papers Open source

SIG

HYP

arXiv cs.AI·May 19

Training Infinitely Deep and Wide Transformers

Theoretical paper on transformer training in mean-field regime (infinite depth and width). Authors model training as controlling a neural PDE (vs ODE for ResNets), establish well-posedness of forward pass, derive explicit formulas for Wasserstein gradients, and prove gradient flow convergence to global minima under NTK injectivity conditions.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

A More Word-like Image Tokenization for MLLMs

DiVT (Disentangled Visual Tokenization) clusters patch embeddings into coherent semantic units for MLLMs, creating discrete meaningful visual tokens instead of continuous streams. Adapts token budget to image complexity, reducing memory and latency while improving LLM compatibility.

Vision Code generation

SIG

HYP

arXiv cs.AI·May 19

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

DISCA, a training-free inference-time method, culturally aligns LLMs using World-Values-Survey-grounded persona agents. Tested across 20 countries and 7 backbones (2B–70B), it reduces cultural misalignment by 10–24% on MultiTP without weight modification.

Alignment Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

When Marginals Match but Structure Fails: Covariance Fidelity in Generative Models

Theoretical paper on generative model evaluation. Authors show standard criteria (marginal matching) don't guarantee covariance structure preservation. They introduce D_Sigma = ||Sigma_P - Sigma_Q||_F to measure dependence fidelity, with formal proofs and validation on Fashion-MNIST VAE, RNA-seq (TCGA-BRCA, n=1111), and Alzheimer's data (n=113).

Evals Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

An Amortized Efficiency Threshold for Comparing Neural and Heuristic Solvers in Combinatorial Optimization

Paper evaluating energy efficiency of neural vs heuristic combinatorial solvers. Defines Amortized Efficiency Threshold (AET): deployment volume where neural network training cost breaks even. On CVRP (n=50), attention-based solver from Kool et al. (2019) reaches energy parity at ~4560 deployed instances. Per-instance neural-to-heuristic ratio: 2.29e-3.

Benchmarks Reasoning Open source

SIG

HYP

arXiv cs.AI·May 19

StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video

StreamPro introduces StreamPro-Bench, a benchmark evaluating proactive video streaming understanding across three dimensions: perception, temporal reasoning, and proactive agency. The framework proposes CB-Stream Loss to address supervision imbalance and applies GRPO with multi-grained rewards. Results: 41.5 on StreamPro-Bench vs 10.4 previously, 78.9 on StreamingBench-RTVU.

Vision Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

Sustainability via LLM Right-sizing

Comparative study of 11 LLMs (GPT-4o, Gemma-3, Phi-4, etc.) across 10 common workplace tasks. GPT-4o delivers superior performance but at higher cost and environmental footprint; smaller models (Gemma-3, Phi-4) achieve strong results with better efficiency. Advocates task-aware sufficiency assessments over performance-maximizing benchmarks.

Benchmarks Evals Open source

SIG

HYP

arXiv cs.AI·May 19

Harnessing LLM Agents with Skill Programs

HASP converts textual skills for LLM agents into executable Program Functions that actively intervene in the agent loop at failure-prone states. The framework achieves 25% improvement on web-search vs ReAct and 30.4% gain on math/coding vs Search-R1 through inference-time intervention, post-training, or self-improvement mechanisms.

AI Agents Reasoning Code generation

SIG

HYP

arXiv cs.CL·May 19

Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning

New IH-GRPO algorithm decouples tool invocation from execution to enhance LLM mathematical reasoning. Achieves 1.87–2.53% improvements on mathematical benchmarks with Qwen3 (1.7B–8B). Code released.

Reasoning AI Agents Reinforcement learning

SIG

HYP

arXiv cs.CL·May 19

Algorithmic Cultivation: How Social Media Feeds Shape User Language

Longitudinal study of 235M posts from 4M Bluesky users showing algorithmic feed exposure (News, Science, Blacksky) measurably shapes user language: stylistic accommodation, semantic alignment, register formalization. Effects vary by feed; reposting is the strongest predictor of linguistic convergence.

Papers Evals

SIG

HYP

arXiv cs.AI·May 19

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

AMARIS introduces persistent evaluation memory to improve rubrics in LLM RL fine-tuning. The system accumulates evaluation diagnostics over time, uses static and dynamic retrieval to contextualize rubric modifications, and adds ~5% time overhead. Experiments show consistent gains across closed and open-ended domains.

Reinforcement learning Fine-tuning Evals

SIG

HYP

arXiv cs.AI·May 19

Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

Focused Forcing optimizes KV caches in autoregressive video diffusion generation by selecting relevant historical frames per-frame and per-head. The method combines attention scores with diversity scores, achieving 1.48× end-to-end acceleration without training while improving visual quality and text alignment.

Video generation Reasoning Evals

SIG

HYP