Page 46 of 192

AllHigh signalRecent

7679 articles

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

arXiv paper proposing a formal framework for combining LLM and human evaluations. Uses a doubly robust estimator (missing data approach) to determine optimal sample sizes of human ratings needed for benchmark validation, shifting LLMs from substitutive to auxiliary role.

Evals Benchmarks AI safety

SIG

HYP

arXiv cs.AI·May 19

Sherpa.ai Privacy-Preserving Multi-Party Entity Alignment without Intersection Disclosure for Noisy Identifiers

Sherpa.ai introduces a multi-party protocol for privacy-preserving entity alignment in vertical Federated Learning. The method hides intersection membership while enabling exact and typo-tolerant matching, without revealing which samples are shared across parties.

Alignment AI safety

SIG

HYP

arXiv cs.AI·May 19

Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap

Six modern tabular foundation models form a highly redundant ensemble (mean Q-statistic 0.961). On 153 OpenML classification tasks, the best ensemble (two-level cascade stacking) gains +0.18% accuracy at 253× compute cost. Friedman-Nemenyi analysis places three ensembles and the best single model in the same equivalence group. Greedy selection is recommended as practical default.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Spiker-LL: An Energy-Efficient FPGA Accelerator Enabling Adaptive Local Learning in Spiking Neural Networks

Spiker-LL is an FPGA accelerator for spiking neural networks (SNNs) enabling adaptive on-device learning. Built on Spiker+ architecture, it implements the STSF local learning rule with minimal overhead. On MNIST/F-MNIST/DIGITS: 93% accuracy, sub-millisecond latency, <0.1 mJ per inference, DSP-free.

Reasoning Infrastructure Open source

SIG

HYP

arXiv cs.AI·May 19

Algorithmic Cultivation: How Social Media Feeds Shape User Language

Longitudinal study of 235M posts from 4M Bluesky users showing algorithmic feed exposure (News, Science, Blacksky) measurably shapes user language: semantic alignment, register formalization, psycholinguistic restructuring. Reposting is the strongest predictor of linguistic convergence across feeds.

Papers Evals AI safety

SIG

HYP

arXiv cs.AI·May 19

D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

D²Evo is an RL framework to enhance LLM reasoning through self-evolution. The method generates medium-difficulty training samples by mining anchors matched to model capability, then jointly optimizes a Questioner and Solver. Results: outperforms existing methods on mathematical reasoning benchmarks with <2K real examples.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

Researchers train KinGPT (25M parameters) on chess data and demonstrate that high benchmark scores of chess-trained LLMs stem primarily from pattern-matching rather than genuine rule understanding. LLM-Modulo, a verifier-in-the-loop framework, improves RedPajama 3B from 1.2% to 21.2% best-move accuracy. Training code, datasets, and model checkpoints open-sourced.

Benchmarks Evals Fine-tuning

SIG

HYP

arXiv cs.AI·May 19

A Theory of Training Profit-Optimal LLMs

Economic model combining scaling laws and microeconomic theory to characterize rational behavior of LLM training firms. Analyzes profit maximization under compute-bound and data-bound regimes: in compute-bound, optimal model size tracks hardware efficiency (FLOPs/$) at near-linear rate; in data-bound, optimal training expenditure scales as D²/E.

Benchmarks Papers Business

SIG

HYP

arXiv cs.AI·May 19

Multi-agent AI systems outperform human teams in creativity

Multi-agent LLM teams outperform human teams in creativity (Cohen's d=1.50) across 4,541 AI ideas versus 341 human ideas on six tasks. Advantage driven by novelty while maintaining usefulness. LLM teams benefit from efficient exploration (high semantic spread, shorter paths), while humans benefit from smooth conversational flow (high local coherence, frequent pivots).

Multi-agent Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Self-supervised Hierarchical Visual Reasoning with World Model

ResDreamer, a hierarchical world model, reconstructs residuals at each layer to progressively abstract visual dynamics in a self-supervised manner. Without domain-specific knowledge, it achieves state-of-the-art sample and parameter efficiency for RL in 3D adversarial environments. Code released.

Reinforcement learning Reasoning Vision

SIG

HYP

arXiv cs.LG·May 19

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

arXiv paper proposing a formal statistical framework to combine LLM and human evaluations. Uses a doubly robust estimator (missing data approach) to determine optimal sample sizes of human ratings needed for benchmark validation, based on LLM judgment predictability.

Evals Papers AI safety

SIG

HYP

arXiv cs.CL·May 19

Sustainability via LLM Right-sizing

Empirical study comparing 11 LLMs (GPT-4o, Gemma-3, Phi-4, etc.) across 10 everyday occupational tasks. GPT-4o delivers superior performance but at higher cost; smaller models achieve strong results with better efficiency. Proposes task-aware sufficiency assessments over performance-maximizing benchmarks.

Benchmarks Evals Open source

SIG

HYP

arXiv cs.AI·May 19

Semantic Generative Tuning for Unified Multimodal Models

Semantic Generative Tuning (SGT) aligns visual understanding and generation in unified multimodal models by using image segmentation as a generative proxy. High-level semantic tasks improve feature linear separability and visual-textual attention allocation, outperforming decoupled training approaches.

Vision Image generation Fine-tuning

SIG

HYP

arXiv cs.AI·May 19

MARS: Technical Report for the CASTLE Challenge at EgoVis 2026

MARS is a multimodal system for the CASTLE 2026 challenge that reasons over 4 days of activity, 15 synchronized perspectives, transcripts, and auxiliary modalities (photos, videos, gaze, thermal imagery, heartrate). The approach uses DeepSeek for video summaries and a GPT-5.4 agent to select evidence sources. The system achieved second place on the final leaderboard.

AI Agents Multi-agent Vision

SIG

HYP

arXiv cs.AI·May 19

Data Presentation Over Architecture: Resampling Strategies for Credit Risk Prediction with Tabular Foundation Models

Comparative study of tabular foundation models (TFMs) vs classical models on credit default prediction. On Home Credit and Lending Club datasets, context construction strategy (balanced vs uniform sampling) explains more AUC-ROC variance than model choice: +3-4 AUC points. With 5K-10K balanced examples, TFMs match classical GBDTs while improving default-class recall.

Benchmarks

SIG

HYP

arXiv cs.AI·May 19

SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

SAPO improves generative recommendation by aligning reinforcement learning optimization to individual reasoning steps. Instead of assigning a single advantage to the entire response, SAPO computes separate group-relative advantages for each reasoning step and SID token, stabilizing training and outperforming baselines across three real-world datasets.

Reinforcement learning Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

Learning How to Cube

A neuro-symbolic post-training framework trains a 4B-parameter model to generate cubing heuristics for SAT via SFT+DPO. The model achieves pass@5=53 on 100 SAT competition benchmarks, matching the best symbolic heuristic and surpassing Claude-Sonnet-4 (50). Data comes from an MCTS pipeline exploring splitting decisions over SAT competition formulas.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

General Preference Reinforcement Learning

New GPRL (General Preference Reinforcement Learning) method replaces scalar reward models with General Preference Model (GPM) using k skew-symmetric subspaces. Tested on Llama-3-8B-Instruct: 56.51% win rate AlpacaEval 2.0, outperforms SimPO and SPPO on Arena-Hard, MT-Bench, WildBench by preventing single-axis reward hacking.

Reinforcement learning Llama Alignment

SIG

HYP

arXiv cs.AI·May 19

A Comparative Study in Surgical AI: Potential and Limitations of Data, Compute, and Scaling

Comparative study on surgical AI: multi-billion parameter Vision Language Models fail at neurosurgical tool detection despite extensive training. Scaling experiments show diminishing improvements. Obstacles persist across architectures, suggesting data and compute alone are insufficient.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

MirrorBench is a benchmarking framework to evaluate user-proxy agents in conversational systems. It combines 6 metrics (MATTR, Yule's K, HD-D, GTEval, Pairwise Indistinguishability, Rubric-and-Reason) to measure realism of LLM-generated user utterances across 4 public datasets. Open-source code released.

AI Agents Evals Benchmarks

SIG

HYP

arXiv cs.CL·May 19

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

AMARIS enhances rubric-based RL by integrating persistent evaluation memory. The system accumulates evaluation diagnostics over time, retrieves them via static and semantic search, and continuously adapts reward rubrics. Experiments show performance gains with ~5% time overhead.

Reinforcement learning Fine-tuning Evals

SIG

HYP

arXiv cs.AI·May 19

Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods

Theoretical paper formulating multi-step reasoning as s-t connectivity on knowledge graphs. Shows phase transition: if pre-training knowledge is fragmented into small components, augmentation requires Ω(√n) queries; once density threshold is crossed forming a giant component, constant expected queries suffice.

RAG Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

EXG: Self-Evolving Agents with Experience Graphs

EXG is an experience graph framework for self-evolving LLM-based agents. It organizes successes and failures into structured, relational representations, enabling real-time cross-task experience reuse and offline reuse as external memory. Tested on code generation and reasoning benchmarks, EXG outperforms reflection and memory-based baselines.

AI Agents Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

EvolveR is a framework enabling LLM agents to learn from their own experiences through a closed-loop lifecycle. It combines offline self-distillation (synthesizing interaction trajectories into reusable strategic principles) and online interaction (actively retrieving distilled principles to guide decisions). Tested on complex multi-hop QA benchmarks, it outperforms existing agentic baselines.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

CheckSupport: A Local LLM-Powered Tool for Automated Manuscript Submission Checklist Selection and Completion

CheckSupport is an open-source system using locally-deployed LLMs to automate reporting checklist recommendation and completion for scientific manuscripts. Evaluated on peer-reviewed manuscripts, it achieves 90% accuracy for checklist recommendations and 88% for item-level completion, processing each manuscript in 12.5 seconds on CPU-only hardware.

Llama Prompt engineering Evals

SIG

HYP

arXiv cs.AI·May 19

Detecting Verbatim LLM Copy-Paste in Homework

SteganoPrompt, an open-source web tool, detects verbatim copies of assignment prompts submitted to LLMs. It encodes an invisible instruction in the prompt via the Unicode Tags block (U+E0000–U+E007F), creating a detectable signature in the model's response. Tested across 7 LLM families, the approach bypasses limitations of post-hoc detectors and requires no cooperation from model providers.

Evals AI safety Prompt engineering

SIG

HYP

arXiv cs.AI·May 19

Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective

arXiv paper on supervised fine-tuning (SFT) effectiveness for LLMs. Authors show SFT primarily removes noise-like token interactions but rarely acquires reliable new ones. The denoising phase is extremely brief; continued fine-tuning introduces overfitted interactions. Implications for early stopping and LLM training.

Fine-tuning Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

StyleText: A Large-Scale Dataset and Benchmark for Stylized Scene Text Inpainting

StyleText is a dataset of 28,518 image-mask-prompt triplets for scene text inpainting with style preservation. Automated pipeline combines LLM templating, Flux with KV-cache injection, OCR, polygon mask extraction, and FluxFill augmentation. FluxFill+LoRA baseline substantially improves OCR accuracy while maintaining scene style consistency.

Benchmarks Image generation Vision

SIG

HYP

arXiv cs.AI·May 19

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

PAIR is an internal reward model for multi-step LLM training via GRPO. It combines a hidden-state probe (belief consistency) with a lightweight attention head to generate dense step-level reward signals without external model calls or ground-truth dependencies.

Reinforcement learning Reasoning AI Agents

SIG

HYP

arXiv cs.AI·May 19

DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition

DuIVRS-2 is an LLM-based Interactive Voice Response system deployed at Baidu Maps for large-scale POI attribute acquisition. Using FSM-guided data augmentation, selective generation, and Chain-of-Thought mechanisms, the system processes 0.4 million calls daily with 83.9% Task Success Rate and 130ms latency.

AI Agents Reasoning Voice

SIG

HYP

arXiv cs.CL·May 19

Compress the Context, Keep the Commitments: A Formal Framework for Verifiable LLM Context Compression

Context Codec introduces a formal framework for compressing LLM context while preserving semantic commitments (goals, constraints, decisions, evidence). It defines metrics (Critical Atom Recall, Commitment Density) and CCL, an ASCII-first compact rendering language, to make context compression verifiable and auditable.

Prompt engineering Reasoning Papers

SIG

HYP

arXiv cs.CL·May 19

Multi-agent AI systems outperform human teams in creativity

Multi-agent LLM teams outperform human teams in creativity (Cohen's d=1.50) across 4,541 AI ideas vs 341 human ideas on 6 tasks. Advantage driven by novelty while maintaining usefulness. LLM teams benefit from efficient exploration (high semantic spread, shorter paths), while humans benefit from local conversational coherence.

Multi-agent Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

SAME: A Semantically-Aligned Music Autoencoder

SAME is an autoencoder for stereo music and general audio achieving 4096× temporal compression while maintaining reconstruction quality. The architecture combines a transformer backbone, semantic regularization, phase-aware reconstruction losses and improved discriminators. Two variants (SAME-L and SAME-S) are released in open-weights.

Open source Papers

SIG

HYP

arXiv cs.LG·May 19

PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation

PropGuard is a security framework for LLM-based multi-agent systems. It constructs a dual spatio-temporal graph to trace malicious instruction propagation across agents and rounds, then applies source-guided remediation. Tested across four communication architectures and five attack scenarios.

Multi-agent AI safety AI Agents

SIG

HYP

arXiv cs.AI·May 19

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

FUNCanon breaks down long-horizon manipulation tasks into action sequences (actor-verb-object) and canonicalizes objects by functional affordances using VLM cues. FuncDiffuser, an object-centric and action-centric diffusion policy, learns on aligned data to generalize across object categories and enable cross-task behavior reuse.

Robotics Vision AI Agents

SIG

HYP

arXiv cs.AI·May 19

FediLoRA: Practical Federated Fine-Tuning of Foundation Models Under Missing-Modality Constraints

FediLoRA introduces a federated LoRA fine-tuning framework for vision-language models (VLLMs) addressing imbalanced LoRA ranks from heterogeneous resources and missing modalities from user errors or device failures. The method combines simple averaging with structured editing, validated on general-domain and medical-domain benchmarks.

Fine-tuning Vision Papers

SIG

HYP

arXiv cs.AI·May 19

CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models

CarbonScaling is a hardware-aware analytical framework modeling carbon emissions during frontier LLM training. It integrates neural scaling laws, distributed training strategies, accelerator modeling, and operational/embodied carbon accounting. Source code available on GitHub.

Benchmarks Papers Infrastructure

SIG

HYP

arXiv cs.AI·May 19

Not What You Asked For: Typographic Attacks in Household Robot Manipulation

Researchers demonstrate typographic attacks against household manipulation robots using CLIP. By placing adversarial stickers, they achieve 67.8% attack success rate on HomeRobot benchmark in Habitat simulation, causing physical grasping and transport errors of wrong objects.

Vision Robotics AI safety

SIG

HYP

arXiv cs.AI·May 19

AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

AgroCoT is a VQA benchmark with 4,759 Chain-of-Thought samples designed to evaluate reasoning capabilities of Vision-Language Models in agriculture. Evaluation of 30 VLMs (proprietary and open-source) reveals significant gaps in zero-shot reasoning, highlighting the importance of CoT for precision farming applications.

Vision Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 19

MoleCode unlocks structural intelligence in large language models

MoleCode is an LLM-native molecular language representing molecules as explicit graphs with typed entities and direct relations, replacing implicit SMILES strings. Training-free, it improves frontier LLMs on molecular reasoning, editing and generation tasks, especially for unfamiliar molecules, topology-sensitive operations and larger structures.

Reasoning Code generation Papers

SIG

HYP