June 2026

2731 articles

Finsler Geometry, Graph Neural Networks, and You

Researchers propose graph neural networks based on Finsler geometry to overcome limitations of graph Laplacian-based architectures (isotropic operators). They prove discrete convergence to the true operator on manifolds and express this operator as a GNN layer, validating recovery of nonlinear geometries.

Papers Benchmarks

SIG

HYP

arXiv cs.CL·Jun 17

When English Isn't the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning

Empirical study of cross-lingual transfer in In-Context Learning (ICL) spanning 7 tasks, 6 models, and typologically diverse languages. Results show that fine-tuning-based expectations do not consistently apply in the ICL regime, proposing alternative heuristics for effective source language selection.

Benchmarks

SIG

HYP

arXiv cs.CL·Jun 17

VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination

VoidPadding introduces a dedicated [VOID] token for padding in masked diffusion language models (MDLMs), freeing [EOS] for semantic termination. On Dream-7B-Instruct, it improves mathematical reasoning and code generation benchmarks by +17.84 points over baseline and +6.95 over RainbowPadding, reducing NFE by 55.7%.

Code generation Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 17

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

MODE is an expert-level mixed-precision quantization framework for MoE multimodal LLMs. It decomposes expert selection frequency by modality (vision/text) and filters redundant vision tokens to correct estimation biases. Results: <2.9% performance loss at W3A16.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 17

Learning task-specific subspaces via interventional post-training of speech foundation models

Post-training refinement method for speech foundation models using interventional contrastive learning. Transforms entangled representations into separate content and speaker subspaces via interventional dataset and multi-part contrastive loss. Improves out-of-domain speaker verification and keyword spotting performance.

Voice Fine-tuning Papers

SIG

HYP

arXiv cs.LG·Jun 17

Towards Fast GNN Surrogates for CO2 Migration in Complex Geological Formations

GNN surrogate for CO₂ migration forecasting in complex geological formations. Model trained on SPE11A benchmark with anisotropic message-passing mechanism capturing directional transport. Produces competitive forecasts of gas saturation and liquid-phase density over extended forecasting horizons.

Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 17

Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue

Fine-tuning Qwen3.5-27B to predict PHQ-9 depression scores directly from transcripts of conversations with an AI mental health application. 6,283 users (3,111 ground-truth labels + Claude Opus pseudolabels). Performance: MAE=2.6, RMSE=4.0, r=0.80, AUC=0.91 at PHQ-9≥10 clinical threshold.

Fine-tuning Reasoning Qwen

SIG

HYP

arXiv cs.CL·Jun 17

Environment-Grounded Automated Prompt Optimization for LLM Game Agents

Automated prompt optimization framework for LLM agents in interactive environments. Decomposes observation-to-action pipeline into descriptor and action-selection agents, iteratively refines via LLM-driven evolutionary loop guided by environment returns. On BabyAI/BALROG: improves from 0% to 72.5% success on PutNext without fine-tuning.

AI Agents Prompt engineering Reinforcement learning

SIG

HYP

arXiv cs.CL·Jun 17

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

GameCraft-Bench evaluates coding agents' ability to generate playable games end-to-end in Godot. The benchmark comprises 140 tasks across 15 game families. Top agents achieve only 41.46% success, revealing struggles to produce complete games with sufficient content and coherent visual feedback.

Code generation AI Agents Benchmarks

SIG

HYP

arXiv cs.CL·Jun 17

Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

RL-trained reasoning models often generate unnecessary reasoning after finding the correct answer (overthinking). This paper introduces Dynamic Rollout Editing (DRE), a training-time intervention during GRPO that edits successful trajectories continuing after answer emergence, preserving the verified prefix and weakening preference signals for unnecessary thinking.

Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·Jun 17

DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue

DiagFlowBench evaluates how language models handle off-procedure inputs in industrial diagnostic dialogue. A dataset of 1,676 multi-turn conversations derived from 50 diagnostic flowcharts reveals models often select a real but contextually inadequate step rather than hallucinate, exposing a vulnerability: plausible but wrong advice grounded in documentation.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.CL·Jun 17

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

ChLogic is an English-Chinese aligned benchmark evaluating the robustness of logical reasoning in LLMs. Built from formal logical templates, it contains 100 aligned propositions and 15 Chinese-specific phenomena. Experiments on Qwen3, Ministral, and GLM reveal a persistent English-Chinese performance gap, with back-translation producing mixed effects.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 17

Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation

Study on bilingual fine-tuning for low-resource ASR across 9 language pairs. Uses language identification tokens prepended to input text. Results: bilingual fine-tuning improves performance when language ID accuracy is high; providing the token at inference mitigates low language ID performance.

Voice Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 17

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

Comparative study of LLM abilities to predict next speaker, turn changes, and addressee in multi-party conversations. On the AMI corpus, LLMs outperform supervised models and humans in next speaker prediction without audio-visual access. MM-LLMs exceed text-based LLMs but remain below human performance for addressee and turn-change prediction.

Benchmarks Evals Vision

SIG

HYP

arXiv cs.CL·Jun 17

The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

Pruned models pass multiple-choice benchmarks but fail in open generation. Multilingual study shows that under high-sparsity pruning (Wanda), correct answers are demoted rather than erased: they reappear with beam search or sampling. Multiple-choice benchmarks overstate the usability of compressed LLMs.

Benchmarks Evals Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 17

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

OPD-Evolver is a slow-fast co-evolution framework that cultivates self-evolving agents through on-policy self-distillation. The system manages a four-level memory hierarchy to read, use, write, and maintain experience. Across multi-domain benchmarks, OPD-Evolver outperforms ReasoningBank (+11.5%) and Skill0 (+5.8%), with OPD-Evolver-9B rivaling Qwen3.5-397B and Step-3.5-Flash.

AI Agents Reasoning Reinforcement learning

SIG

HYP

arXiv cs.CL·Jun 17

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

Study on agent routing at scale: with 110 agents and 584 tools, F1 accuracy drops 16–23 percentage points on under-specified requests. Analysis decomposes degradation into retrieval gap and confusion gap (10pp oracle ceiling loss). Embedding-based shortlisting recovers +10–11pp F1 at full scale across models.

AI Agents Multi-agent Evals

SIG

HYP

arXiv cs.LG·Jun 17

ProCUA-SFT Technical Report

ProCUA-SFT is a dataset of 3.1M step-level SFT samples generated automatically from 93K synthetic trajectories across 2,484 application combinations. Fine-tuning UI-TARS 7B on ProCUA-SFT achieves 45.0% on OSWorld, a +18.7 percentage-point improvement over the base model and +35% above AgentNet. The pipeline uses Kimi-K2.5 as task generator, precondition judge, and trajectory executor.

AI Agents Benchmarks Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 17

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

STATEWITNESS, an activation explainer, detects deception in reasoning LLMs by reading the target model's hidden states and answering natural-language queries. Achieves 0.916 mean AUROC, 11.6% relative gain over best black-box text monitor, 25.0% over best activation-probe baseline. Provides token- and sentence-level evidence traces for human inspection.

Reasoning AI safety Alignment

SIG

HYP

arXiv cs.LG·Jun 17

The Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance

InferBERT combines transformers with Do-calculus to detect causal adverse drug events in pharmacovigilance. Comparative study on AILF and TRAM benchmarks: BioBERT outperforms XGBoost, ALBERT, and Med-LLaMA. Finding: domain-specific pre-training outweighs model size.

Benchmarks Fine-tuning AI safety

SIG

HYP

arXiv cs.CL·Jun 17

An expressivity analysis of hierarchical modelling in deep transformers via bounded-depth grammars

Theoretical analysis of deep transformer expressiveness through bounded-depth, non-recursive context-free grammars. Authors explicitly construct transformers with positional attention whose depth scales linearly with grammar depth, demonstrating these architectures can encode abstract grammatical states into linearly separable subspaces within the residual stream.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 17

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

LLM-as-Environment-Engineer framework: the policy model analyzes failure trajectories and proposes modifications to the next-stage RL training environment configuration. MAPF-FrozenLake testbed with multi-dimensional configurations. Qwen3-4B outperforms GPT and Gemini on proposed benchmarks.

Reinforcement learning Multi-agent Reasoning

SIG

HYP

arXiv cs.CL·Jun 17

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

Study of second-order bias in LLMs: how models judge biased content, beyond generation. Grounded in entitlement epistemology, the method evaluates whether LLMs infer demographics without sufficient support. Findings: systematic bias across target groups, evasion of safety guardrails, persistence of demographic triggers.

Evals AI safety Alignment

SIG

HYP

arXiv cs.LG·Jun 17

Rethinking Groups in Critic-Free RLVR

arXiv paper on critic-free reinforcement learning for LLMs. Authors challenge the role of rollout groups in existing methods and propose negative token filtering to enable stable single-rollout training, improving performance on agentic tasks compared to group-based RL techniques.

Reinforcement learning Reasoning AI Agents

SIG

HYP

arXiv cs.LG·Jun 17

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

PowerOPD stabilizes on-policy distillation for LLMs by replacing unbounded log-ratio rewards with Box-Cox power transformation. On 6 mathematical reasoning benchmarks with Qwen3, achieves +6.37 Avg@8/+5.71 Pass@8 gains vs vanilla OPD, reduces wall-clock time by 59.2% and peak GPU memory by 23.1%.

Fine-tuning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 17

Toward Controllable Catalyst Inverse Design via Large-Scale Autoregressive Pretraining

Generative model based on GPT architecture for inverse design of heterogeneous catalysts. Pretrained on 133 million structures, fine-tuned on ~460,000 optimized structures. Achieves 98% structural validity, 95% optimization validity, and improves screening efficiency 1.5–4× for reaction-targeted catalyst discovery.

Papers Benchmarks Fine-tuning

SIG

HYP

arXiv cs.LG·Jun 17

ReRAM-aware Model Finetuning addressing I-V Non-linearity and Retention Errors

Hardware-aware finetuning method for DNN deployment on ReRAM crossbar arrays. Uses range-shrunk sinh transformation to mitigate I-V non-linearity and incorporates retention errors into regularization loss. Results: ResNet18/DeiT-Tiny no degradation, MobileNetV3 <2% on ImageNet, F-1 -1 point on SQuAD v2.

Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 17

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

Study of 450 chest X-ray reports showing LLM rewriting for standardization preserves image-text alignment (2.5% degradation) but erodes 26.8–29.3% of clinical entities and 14.9–16.5% of uncertainty language. The paradox: tasks producing 'cleaner' text pull content away from images.

Vision RAG Evals

SIG

HYP

arXiv cs.CL·Jun 17

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

AIPatient Arena evaluates LLMs in multi-turn clinical consultation across 8 competence dimensions using EHR-grounded knowledge graphs. On 437 patients, models excel in questioning (4.43-4.99/5) and ethical conduct (4.38-4.93/5), but fail in diagnostic accuracy (2.63-3.55/5) and information coverage (2.08-3.02/5). Weaknesses include repetitive questioning, omitted medical history, inadequate uncertainty handling.

Evals Reasoning AI safety

SIG

HYP

arXiv cs.AI·Jun 17

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

PreAct compiles successful runs of computer-using agents into small state-machine programs, replayed 8.5-13x faster with no per-step LLM calls. An independent evaluator validates each program before storage. Across three benchmarks (mobile, desktop, web), this verification prevents faulty program accumulation (+1.75-2.6 tasks).

AI Agents Code generation Benchmarks

SIG

HYP

Vercel AI Blog·Jun 17

Introducing Vercel Connect

Vercel Connect, now in Public Beta, replaces long-lived stored tokens with runtime credential exchange. Agents receive short-lived, task-scoped credentials through reusable connectors (Slack, GitHub, etc.), eliminating risks from permanent token leaks.

AI Agents Tools Infrastructure

SIG

HYP

arXiv cs.LG·Jun 17

Rift: A Conflict Signature for Deception in Language Models

Researchers identify an internal signature of deception in language models: deceptive responses show 2.1-2.3x higher residual rank than naively false answers. This signature detects deception with 100% accuracy on GPT-2, Qwen2.5, and Phi-3, and transfers zero-shot across model families and languages (AUC 0.933-1.0).

AI safety Alignment Evals

SIG

HYP

arXiv cs.LG·Jun 17

MM++: Unsupervised Scale-Invariant Multilayer OOD Detection via Top-K Gated Feature Fusion

MM++ is an unsupervised, post-hoc method for out-of-distribution detection. It fuses intermediate layers selected by entropy density with the final representation using Ledoit-Wolf regularized covariance, requiring no auxiliary OOD data, fine-tuning, or architectural changes.

Evals AI safety

SIG

HYP

arXiv cs.CL·Jun 17

Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

Study of LLM verbalized confidence reliability in machine translation. Five methods for extracting per-token confidence without internal signal access are compared against predicted probabilities. Results: similar performance for error detection and calibration, but little correlation between internal and verbalized methods.

Evals Reasoning

SIG

HYP

arXiv cs.CL·Jun 17

MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task

MLLP-VRAIN group participates in IWSLT 2026 simultaneous speech translation using Parakeet and Qwen 3.5 models. Cascaded system with adaptive policies and RAG mechanism for domain-specific context. +5.82 XCOMET-XL improvement on En→De test set versus previous year.

Qwen RAG Code generation

SIG

HYP

arXiv cs.CL·Jun 17

Are you speaking my languages? On spoken language adherence in multimodal LLMs

LLM-based ASR systems often misidentify output languages in multilingual contexts. Authors propose three mitigation strategies: zero-shot prompting, supervised fine-tuning, and Chain-of-Thought reasoning to improve language adherence while preserving code-switching flexibility and ASR performance.

Voice Prompt engineering Fine-tuning

SIG

HYP

arXiv cs.LG·Jun 17

Discrete Autoregressive Transformer for Generative Mechanism Synthesis

Discrete autoregressive transformer for mechanism synthesis. Conditional sequence model with VAE latent and quantized joint coordinates. Trained on >1M mechanisms with Chamfer distance and DTW metrics. Mean Chamfer distance 0.0132, DTW 0.153 on held-out tests.

Code generation Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 17

Amortized Probabilistic Retrieval of Atmospheric CO2 from OCO-2 Spectra Using Deep Learning with Laplace Approximations and Normalizing Flows

Deep learning framework for retrieving atmospheric CO2 from NASA's OCO-2 satellite spectra. Uses Laplace approximations and normalizing flows for uncertainty quantification. Inference orders of magnitude faster than operational algorithms, with better-calibrated non-Gaussian posterior estimates.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 17

Sum-of-Squares Degree Barriers for the Reweighted-Hinge Method in Robust Halfspace Learning: A Christoffel-Function Characterization

Theoretical paper on Sum-of-Squares degree barriers for robust halfspace learning under malicious noise. The Christoffel function exactly characterizes corruption hidden from bounded-degree certificates. Proves a margin-degree tradeoff and a degree-2t algorithm achieving the frontier η^(1-1/2t).

Papers Reasoning AI safety

SIG

HYP

arXiv cs.LG·Jun 17

Uncertainty Quantification of Engineering Structures by Polynomial Chaos Expansion and Multivariate Active Learning

Adaptive sequential sampling method for polynomial chaos expansion surrogate models, generalized for multiple quantities of interest. The approach balances input space exploration with exploitation of aggregated variance across outputs, improving surrogate accuracy and stability compared to Latin Hypercube Sampling.

Benchmarks Evals

SIG

HYP

arXiv cs.AI·Jun 17

Nothing from Something: Can a Language Model Discover 0?

Study on language models' ability to discover the mathematical concept of zero. GPT-2-sized models fail without additional training, but improve substantially after exposure to tens or hundreds of examples. Language pretraining reduces required examples by ~50%.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.CL·Jun 17

From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities

Analysis of 4,434 posts and 50,338 comments on Moltbook showing parasocial interaction cues (intimacy language, reciprocity bids, self-identification) persist in autonomous AI-agent communities. Results validated through keyword matching and LLM annotation reveal strong association between these signals and original poster re-engagement and sustained dyadic patterns.

AI Agents Multi-agent Papers

SIG

HYP

arXiv cs.AI·Jun 17

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

SpeechDx is a multi-task benchmark for clinical speech AI covering 12 datasets and 27 tasks across diverse health conditions. Tasks are structured by speech production stages (conceptualization, formulation, articulation). Evaluation of 12 audio encoders shows large-scale speech models outperform domain-specific ones, but none generalize reliably across clinical speech.

Benchmarks Voice Evals

SIG

HYP

arXiv cs.LG·Jun 17

CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models

CheckMIABench introduces a benchmark for principled evaluation of membership inference attacks (MIAs) on language models. Leveraging intermediate checkpoints from open-source models (Pythia, OLMo, 70M–7B parameters), the authors construct reliable testbeds where training data before and after a fixed point share the same distribution. They evaluate six published attacks and release a modular library (pandora_llm).

Papers Benchmarks AI safety

SIG

HYP

arXiv cs.CL·Jun 17

Self-Generated Error Training for Token Editing in Diffusion Language Models

Training method to improve token editing in diffusion language models (LLaDA2.1). Addresses training-inference mismatch between random corruptions and model's own errors. Uses no-gradient draft pass followed by supervision on self-generated corruptions via LoRA. Reduces edit intensity and transcription errors.

Code generation Fine-tuning Reasoning

SIG

HYP

arXiv cs.LG·Jun 17

When the Next Step Is Not One Step: Distribution-Aware Execution Modeling for Concurrent Go Programs

7B model fine-tuned to predict next step in concurrent Go programs by learning event distributions rather than single labels. On 798 predictions from real bugs (CockroachDB, Kubernetes, gRPC, etcd), achieves 36.2% accuracy with <1000 traces, outperforming Gemini 3.5 Flash zero-shot (34.8%). Dataset, adapters, and tooling released.

Code generation Benchmarks Fine-tuning

SIG

HYP

arXiv cs.AI·Jun 17

MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

MapSatisfyBench is a benchmark for evaluating LLM agents integrated into map services. It measures their ability to identify and satisfy implicit user needs (unspoken decision factors) from real-world behavioral data. Experiments show current agents perform well on explicit task completion but struggle to proactively address implicit factors.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.AI·Jun 17

FllumaOne: A Code-Native Multimodal CAD Dataset with Executable Programs and Kernel-Validated Feature Histories

FllumaOne is a multimodal CAD dataset of 100,000 models generated by executable Python programs in Flluma (OpenCASCADE-based CAD system). Each sample aligns the program with a feature tree, STEP representation, point cloud, and natural-language descriptions. A Qwen2.5-Coder-1.5B baseline achieves 99.98% Python syntax validity and 99.14% STEP-export validity.

Code generation Benchmarks Vision

SIG

HYP

arXiv cs.LG·Jun 17

Geometry-Aware Post-Hoc Uncertainty Quantification in Operator Learning

REEF-GP, a post-hoc uncertainty quantification framework for neural operators, adapts the operator's intrinsic representations to construct geometry-aware uncertainties. Tested on 5 PDE benchmarks, it preserves predictive accuracy while providing calibrated uncertainty estimates, more efficient than deep ensembles.

Papers Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 17

Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

Study of LLM adaptation for 3D CT report generation in medical imaging. RAD3D-Prefix, a lightweight diagnostic-prior framework, integrates image embeddings and multi-label classification logits. Across LLMs from 96.1M to 1.6B parameters, freezing the model and training only projection layers outperforms full fine-tuning, reducing clinical hallucination and overfitting.

Fine-tuning Vision

SIG

HYP

arXiv cs.CL·Jun 17

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

MODE-RAG is a multi-agent system driven by Variational Free Energy to reduce hallucinations in Multimodal Retrieval-Augmented Generation. It uses Monte Carlo Tree Search, logit perturbations, and specialized agents to route high-risk queries and perform post-hoc factual verification. Authors introduce ModeVent, a challenging subset of MultiVent dataset, to evaluate M-RAG robustness.

RAG Multi-agent Vision

SIG

HYP

arXiv cs.LG·Jun 17

Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines

Quantized integer-only transformer implementation for jet tagging on AMD Versal AI Engine (AIE). Reusable software framework automatically converts Python model descriptions to Vitis graph code for low-latency, resource-constrained deployment. Open-source release.

Vision Benchmarks Open source

SIG

HYP

arXiv cs.LG·Jun 17

Online LLM Selection via Constrained Bandits with Time-Varying Demand

Online learning algorithm for dynamic LLM selection in edge-cloud systems under budget constraints (cost, latency). Formulated as constrained stochastic bandit with time-varying demand. Theoretical guarantees: sublinear regret and sublinear constraint violations.

AI Agents Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 17

Multi-Adapter PPO: A Cross-Attention Enhanced Wavelength Selection Framework for LIBS Quantitative Analysis

Multi-Adapter PPO framework for wavelength selection in LIBS quantitative analysis. Uses RL with cross-attention mechanisms and specialized adapters. Outperforms PSO by 28.4% in comprehensive score and 45.2% in prediction accuracy on steel and coal datasets. Code and dataset released.

Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 17

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

MultiClin, a clinical ASR benchmark, evaluates speech recognition model robustness to multiscript variability (multiple valid orthographic forms of the same term). Conventional metrics underestimate performance. Script unification consistently yields best ASR performance.

Benchmarks Voice Evals

SIG

HYP

arXiv cs.AI·Jun 17

Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation

CEO-Bench, a multi-agent benchmark, evaluates LLMs' ability to make strategic resource reallocation decisions. Five frontier models tested on 13 scenarios show high structural validity but diverge on strategic calibration. Failure modes include single-advisor capture and historical amnesia.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.CL·Jun 17

PromptMN: Pseudo Prompting Language

PromptMN is a domain-specific language that structures natural prompts with %-prefixed typed directives (roles, goals, constraints, outputs). Tested on Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5 without fine-tuning, it reduces context ambiguities in agent and software development workflows.

Prompt engineering AI Agents Tools

SIG

HYP

arXiv cs.CL·Jun 17

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

MemSlides introduces a hierarchical memory framework for personalized presentation agents. It separates long-term memory (user profiles, tool experience) from working memory (active preferences), enabling multi-turn local revisions without full deck regeneration.

AI Agents Prompt engineering Tools

SIG

HYP

arXiv cs.AI·Jun 17

FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness

FinAcumen is a financial multimodal reasoning agent that accumulates experience from prior trajectories in persistent memory. The system improves a frozen 8B vision-language model across four financial benchmarks using selective experience activation and a deterministic tool environment for numerical computation and verification.

AI Agents Multi-agent Vision

SIG

HYP

arXiv cs.AI·Jun 17

MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

MathVis-Fine introduces a framework for fine-grained visual dependency modeling in mathematical reasoning. A new dataset augments visual annotations with visual dependency ratings. Two-stage progressive training balances answer correctness and visual grounding rewards according to each sample's intrinsic visual necessity, reducing reward bias.

Reasoning Vision Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games

Researchers propose Equation-to-Behavior Prompting to guide LLMs to simulate diverse cognitive models (Bayesian, motivated reasoning, Grether's α-β model). Large models approximate these specifications via prompting, but small models fail. RL training reduces belief error by 26.5% and improves performance by 2.5–12% on legal persuasion games.

Reasoning Reinforcement learning Evals

SIG

HYP

arXiv cs.CL·Jun 17

Do Large Language Models Always Tell The Same Stories?

Comparative study of narrative diversity across 10 LLMs versus human authors using r/WritingPrompts dataset. Models generate stories significantly more similar to each other than human-written texts, converging toward a generic mean narrative. Temperature scaling and negative prompting fail to address this homogeneity.

Evals Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·Jun 17

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

SEAGym is an evaluation environment for measuring self-evolving LLM agent harness updates (prompts, memory, tools, interaction loop). The study compares ACE, TF-GRPO, and AHE on Terminal-Bench 2.0 and HLE, showing frequent updates don't guarantee held-out performance gains and source diversity affects harness reliability.

AI Agents Reinforcement learning Evals

SIG

HYP

arXiv cs.AI·Jun 17

Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

E³RL, a reinforcement learning method, addresses error propagation in long-horizon reasoning of LLMs. Using autoregressive cross-entropy as an epistemic uncertainty signal, the model can locally correct logical defects and reuse KV cache. On AIME, 4B and 8B models outperform SOTA by 5.349% and 6.514%.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

A Machine-Learned Comorbidity Index

Machine-Learned Comorbidity Index (MLCI) maps diagnosis codes to a single scalar by maximizing normalized Hilbert-Schmidt Independence Criterion across multiple clinical outcomes. Unlike traditional indices (Charlson, Elixhauser), MLCI captures nonlinear risk-outcome relationships and outperforms baselines on multiple EHR datasets.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·Jun 17

Dissecting model behavior through agent trajectories

Study of harness-model alignment via 138k agent trajectories. Authors introduce Simple Strands Agent (SSA), a generic harness tested on Claude, Gemini, GPT, Grok, Qwen across SWE-Pro, SWE-Verified, and Terminal-Bench-2. Beyond pass@1 scores, analysis reveals fine-grained behavioral differences: edit frequency, testing activity, phase transitions.

AI Agents Benchmarks Code generation

SIG

HYP

arXiv cs.CL·Jun 17

Examining the Limits of Word2Vec with Toki Pona

Word2Vec study on Toki Pona, constructed language with ~130 words. Training on 1.4M sentences (7.95M tokens). Comparison of two models: with and without non-Toki Pona tokens (named entities, loanwords). Finding: sparse tokens bring similar words closer; Word2Vec works even with extremely reduced vocabulary, relying on distributional patterns rather than lexicon size.

Embeddings Papers Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

SkillMigrator is an LLM agent that learns reusable web skills and transfers them across sites by matching layout structure rather than specific element references. Induced skills are stored as transferable interaction patterns (TIPs). On WebArena and Mind2Web, SkillMigrator reduces average LLM-action count by 8-10% at matched success rate.

AI Agents Code generation Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

A homotopy-type-theoretic generalization of neurosymbolic inference

Theoretical paper generalizing neurosymbolic systems using homotopy type theory. The framework preserves symmetry and multiple-proof information, converting classical functionals into belief-weighted homotopy cardinalities. Validated on MNIST reasoning-shortcut benchmarks with better calibration than diversity-trained ensembles.

Reasoning Papers

SIG

HYP

arXiv cs.AI·Jun 17

LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline

Curriculum-grounded automated marking pipeline using LLMs to assess exam responses. Grounds model outputs in official curriculum artefacts (syllabus, performance descriptors, marking guidelines). Delivers marking outcomes comparable to human tutors with improved traceability to authorised standards.

Evals Prompt engineering Reasoning

SIG

HYP

arXiv cs.LG·Jun 17

Noise-Driven Escape from Metastable Phases explains Grokking in Deep Neural Networks

Researchers explain grokking (sudden generalization after prolonged overfitting) through first-order phase transitions driven by L2 regularization strength. SGD noise enables networks to escape trapped metastable states, with escape times following Arrhenius scaling. Results extend to nonlinear networks.

Papers Reasoning Evals

SIG

HYP

arXiv cs.LG·Jun 17

Decision-Driven Geosteering Under Uncertainty: A Unified Framework for Sequential Decision Optimization

Sequential decision optimization framework for geosteering under geological uncertainty. Integrates particle filtering for probabilistic subsurface interpretation with value-based reinforcement learning. Compares three decision policies: Approximate Dynamic Programming, Deep Q-learning, and Dual DRL with dueling decomposition, validated on industrial simulator with realistic noise and drilling constraints.

Reinforcement learning Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 17

Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

Two recent studies reach contradictory conclusions about LVLMs' ability to coordinate efficient referring expressions. This research controls for task differences and directly compares prompting styles. Models coordinate efficiently with explicit prompting but fail to infer communicative efficiency needs from implicit prompts.

Prompt engineering Vision Evals

SIG

HYP

arXiv cs.CL·Jun 17

LLMs Infer Cultural Context but Fail to Apply It When Responding

LLMs can infer cultural context but fail to apply it in responses. A new CAPRI dataset shows models recognize cultural conventions (measurement units, time interpretation) but don't spontaneously use them unless explicitly instructed. Biases remain aligned with the model's country of origin.

Benchmarks Alignment AI safety

SIG

HYP

arXiv cs.AI·Jun 17

LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings

LongWebBench is a benchmark evaluating long-horizon webpage generation by vision-language models. It contains 490 real-world pages for structural evaluation and 507 goal-oriented interaction tasks over 129 pages. Experiments show structural fidelity degrades with webpage length, and visually plausible generations often fail to support multi-step executable interactions.

Vision Benchmarks AI Agents

SIG

HYP

arXiv cs.AI·Jun 17

Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search

DivInit improves test-time scaling for agentic search by diversifying initial queries. Instead of sampling k independent queries in parallel, the method generates n candidates then selects k diverse seeds. Gains of 5-7 points on multi-hop QA at matched compute, validated across 5 open-weight models and 8 benchmarks.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 17

Generalization Guarantees for Multi-Input Neural Operator Learning in Sobolev Spaces

Theoretical analysis of generalization guarantees for multi-input neural operators with error measured in Sobolev norms. Framework handles multiple input functions on different domains with varying dimensions and regularities. Approximation and generalization rates explicitly quantify each input space's contribution to the final error bound.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack

DeepInsight is a unified evaluation infrastructure for Physical AI stacks, spanning three orders of magnitude from foundation-model decoding to full-body physics simulation. It uses three invariant abstractions (task, resource, result) to preserve regime heterogeneity while enabling cross-layer regression diagnostics impossible with federated per-segment harnesses.

Reasoning Evals Robotics

SIG

HYP

arXiv cs.LG·Jun 17

Informative Missingness to Generate Irregular Clinical Time Series

Diffusion-based approach to generate irregular clinical time series by jointly modeling laboratory values and observation patterns. Uses DACMI benchmark from MIMIC-III, extends TimeDiff framework to capture dependencies between patient physiology and clinician testing behavior under MNAR-like missingness.

Papers Benchmarks Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 17

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

MemTrace is a benchmark evaluating long-term memory in LLM agents across three dimensions: memory age, question type (current state, earlier state, trajectory), and evidence conditions. Testing 13 configurations, the study finds that evidence use is the primary bottleneck (10× more often retrievable than missing), not retrieval itself.

AI Agents Evals Benchmarks

SIG

HYP

Vercel AI Blog·Jun 17

Introducing eve

Vercel introduces eve, an open-source agent framework for building and deploying agents in production. eve provides built-in infrastructure (model management, fallbacks, logging); developers define only behavior through files (agent.ts, instructions.md, tools). Inspired by Next.js for the web, eve standardizes agent building as Next.js did for web applications.

AI Agents Open source Tools

SIG

HYP

arXiv cs.AI·Jun 17

Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification

Brick-DICL introduces a two-stage dynamic in-context learning framework for automated Brick schema classification of BMS points (936 classes). Combines metadata-RAG and class-RAG to enhance LLM domain knowledge, with multi-LLM filtering to reduce manual verification effort.

RAG Prompt engineering Reasoning

SIG

HYP

arXiv cs.AI·Jun 17

Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes

arXiv paper proposing architecture for distributed peer-to-peer autonomous agent networks. Authors identify three core mechanisms: semantic announcement propagation for collaborator discovery, verifiable identity and multi-topic reputation (MG-EigenTrust), and mechanism design for open task execution. Prototypes and simulations presented.

AI Agents Multi-agent Papers

SIG

HYP

arXiv cs.AI·Jun 17

StepGuard: Guarding Web Navigation via Single-Step Calibration

StepGuard improves web navigation for AI agents via Dynamic Dual-Policy Optimization (DDPO) to handle reward conflicts and Confidence-Guided Adaptive Navigation Reflection (CANR) to calibrate per-step errors. The framework achieves state-of-the-art performance on standard web navigation benchmarks.

AI Agents Reinforcement learning Vision

SIG

HYP

arXiv cs.AI·Jun 17

How Inference Compute Shapes Frontier LLM Evaluation

Study evaluating 12 frontier models on inference compute impact across seven benchmarks. Three interventions tested: larger token budgets, context compaction, repeated submission attempts. Results: increased budgets substantially improve performance on FrontierMath, Humanity's Last Exam, TerminalBench. Fixed-budget evaluations increasingly understate newer model capabilities.

Benchmarks Evals Reasoning

SIG

HYP

Simon Willison·Jun 17

<click-to-play> — a still that plays

Web Component <click-to-play> that converts a static image into a play button to load GIFs on demand. Improves performance by preventing automatic loading of large files.

Tools Code generation

SIG

HYP

Le Big Data·Jun 17

Les lunettes AR de Snap sont là… mais qui osera vraiment les porter ?

Snap launches its consumer AR glasses. The article questions actual product adoption amid competition and social acceptance challenges for users.

Vision

SIG

HYP

Reddit r/LocalLLaMA·Jun 17

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

Analysis of low narrative diversity in LLM-generated stories. The author examines why models produce repetitive tales with similar characters and structures despite varied prompts.

Llama Prompt engineering Evals

SIG

HYP

Hacker News (AI)·Jun 17

Leaked OpenAI financials show $38.5B loss and compute burn

Leaked OpenAI financial documents show a $38.5B loss and significant compute burn. The figures raise questions about the economic viability of large-scale model training.

OpenAI Business

SIG

HYP

Reddit r/LocalLLaMA·Jun 17

Benchmarks from the latest eBay special: W6800 (modded V620)

Benchmarks of modded AMD Radeon Pro W6800 (V620 with W6800 firmware) tested with Qwen 3.6 27B Q6_K on llama.cpp. Vulkan performance: 297.94 t/s (pp1024), 20.35 t/s (tg256). Firmware enables mini-displayport but disables some compute cores.

Benchmarks Open source Infrastructure

SIG

HYP

Hacker News (AI)·Jun 17

France to ditch Palantir's AI data tools in favour of domestic provider

France abandons Palantir's AI data tools in favor of a domestic provider. Political decision to assert technological sovereignty against US solutions.

Regulation Business

SIG

HYP

Vercel AI Blog·Jun 17

Introducing eve, an open-source agent framework

Vercel releases eve, an open-source framework for building and deploying AI agents. Minimal agent requires only two files (model + instructions). Add tools, skills, channels by creating files. Deploy to production with vercel deploy, unchanged from local development.

AI Agents Open source Tools

SIG

HYP

OpenAI Blog·Jun 17

Introducing LifeSciBench

OpenAI introduces LifeSciBench, an expert-authored and expert-reviewed benchmark for evaluating AI systems on real-world life science research tasks and decisions.

Benchmarks OpenAI Evals

SIG

HYP

Hugging Face Blog·Jun 17

Agentic Resource Discovery: Let agents search

Hugging Face introduces agentic resource discovery, enabling AI agents to autonomously search and access models, datasets, and tools available on the platform. This capability enhances agent autonomy in executing complex tasks.

AI Agents Tools Open source

SIG

HYP

Vercel AI Blog·Jun 17

CLI deployment limits removed

Vercel removes CLI-specific deployment limits, enabling faster deployments from local machines and external CI/CD pipelines. Teams and AI agents can now deploy at the pace their workflows demand.

AI Agents Infrastructure Tools

SIG

HYP

Vercel AI Blog·Jun 17

Vercel Passport is now in Public Beta

Vercel Passport, access control tool for deployments, enters public beta. Centralizes authentication via Okta, Auth0, or OIDC providers. Pricing: $100/project/month, unlimited external users.

Tools Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

VibeThinker-3B: what is this witchcraft? Killing it at MathQA like it has ~30B parameters

VibeThinker-3B, a 3B model, achieves exceptional MathQA results comparable to ~30B models. Reddit users report abnormally high performance for its size.

Benchmarks Open source

SIG

HYP

Vercel AI Blog·Jun 16

Vercel for Enterprise Apps and Agents

Vercel launches Enterprise Apps and Agents platform to safely deploy internal AI agents. Vercel Passport authenticates access via identity providers (Okta, Entra, Auth0), while a credential management solution consolidates OAuth, OIDC, and secret injection.

AI Agents Infrastructure AI safety

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

I didn't know it was possible to compile llamacpp to run cuda + vulkan at the same time..

User compiles llama.cpp with CUDA and Vulkan enabled simultaneously on W7800. Achieves +10% tokens/sec improvement in decoding with MiniMax-M3-UD-IQ2_M. Tests dual GPU accelerator combination for performance optimization.

Open source Infrastructure

SIG

HYP

Simon Willison·Jun 16

datasette 1.0a34

Datasette 1.0a34 adds tools to insert, edit and delete rows directly in the web interface. These long-overdue features are available on table and row pages, inspired by Datasette Agent which now supports SQL write operations.

Tools Open source

SIG

HYP