June 2026

2731 articles

Quick thoughts on GLM-5.2 (Bonus: Censorship question answers)

GLM-5.2 shows excellent coherence over extremely long context and adaptive reasoning without excessive verbosity. User reports performance close to GPT-4.5 on heavy analysis and deep research, with faster inference than GLM-5.1. The model has its own distinct conversational signature.

Qwen Reasoning Open source

SIG

HYP

Vercel AI Blog·Jun 18

The Agent Stack

Vercel introduces 'The Agent Stack', a complete framework for building production-grade AI agents. It combines AI SDK (unified multi-model interface), AI Gateway (centralized routing and billing), and enables calling Claude, GPT and others without vendor lock-in.

AI Agents Claude GPT

SIG

HYP

Latent Space·Jun 18

[AINews] Midjourney Medical: scan your organs like you step on a scale

Midjourney announces its second product: a medical application enabling organ scanning via smartphone without specialized medical equipment. The AI model analyzes captured images to provide preliminary diagnostics.

Image generation Vision Business

SIG

HYP

Le Big Data·Jun 18

ChatGPT met de l’ordre dans vos tâches planifiées avec cette nouvelle interface

OpenAI rolls out a new interface for ChatGPT scheduled tasks, improving discovery and organization of user reminders.

GPT Tools

SIG

HYP

Le Big Data·Jun 18

Noam Shazeer : le cerveau de Gemini lâche Google pour OpenAI

Noam Shazeer, key researcher in Gemini's development at Google, is leaving the company to join OpenAI. This departure marks a significant shift in competition between the two AI giants.

Gemini OpenAI Business

SIG

HYP

arXiv cs.CL·Jun 18

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

Study on evaluating AI-generated radiology reports. Researchers show existing LLMs over-penalize harmless rephrasings while detecting clinical errors. They train lightweight metrics on Qwen3-8B and MedGemma-4B outperforming 32B medical models, with dataset and metric release planned.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.AI·Jun 18

Skill-Guided Continuation Distillation for GUI Agents

SGCD, an iterative self-improvement framework, addresses off-trajectory states in GUI agents. The system first runs a plain policy, then uses a skill-guided policy to generate successful continuations. On OSWorld-Verified, SGCD improves success rates of three base models from ~30% to over 50%.

AI Agents Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·Jun 18

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

Decoupled Search Grounding (DSG) decouples search from reasoning via an MCP-compatible gateway. On SimpleQA, FreshQA, and HotpotQA, DSG achieves 86.1% accuracy (vs 87.7% native) with 91% lower search cost and 68% lower latency. In production e-commerce workload, DSG cuts search cost by 98% while maintaining accuracy.

AI Agents MCP RAG

SIG

HYP

arXiv cs.LG·Jun 18

A Link between Shock-wave Theory and Symmetry-reduced Stochastic Gradient Descent for Artificial Neural Networks

Mathematical link established between shock-wave theory and symmetry-quotiented stochastic gradient descent dynamics for neural networks. After quotienting parameter symmetries and entropy coarse-graining, effective dynamics satisfy a viscous Hamilton-Jacobi equation. Applied to MLPs, CNNs, Transformers, and mean-field networks.

Papers Reasoning Reinforcement learning

SIG

HYP

arXiv cs.CL·Jun 18

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

ImpSH, a triplet-based framework, improves implicit hate speech detection by aligning posts with implied statements and using context-bounded semi-hard negatives. Evaluated on IHC, SBIC, and DynaHate with BERT and HateBERT, it enhances cross-domain performance and provides more stable representations than standard supervised contrastive approaches.

Benchmarks AI safety Papers

SIG

HYP

arXiv cs.LG·Jun 18

Beyond AHI: An Interpretable Causal-Discovery-Guided Framework for Sleep Recovery in Connected Health

Causal framework for sleep recovery scoring from multimodal polysomnography. Uses DAG learning on two cohorts (MESA n=1540, MrOS n=825) to identify five physiological domains (respiratory burden, hypoxia, fragmentation, architecture, autonomic regulation). Sleep Recovery Score (SRS) achieves 2.5× stronger alignment with perceived recovery than standard AHI.

Papers Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 18

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

Xcientist is a research harness that externalizes research synthesis and experimental validation for AI scientists into inspectable, contract-governed processes. It organizes literature evidence, idea states, implementation plans, and repair traces as persistent research artifacts, eliminating claim drift where runnable artifacts no longer support the originally claimed mechanism.

AI Agents Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 18

RedactionBench

RedactionBench is a manually annotated benchmark of 200 documents across 11 domains for evaluating PII redaction in context. Introduced with R-Score, a character-level metric, it shows 35 models (NER, SLM, frontier models) fail on contextual redactions: human consensus 89.4% for mandatory redactions, 47.7% for contextual ones.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.CL·Jun 18

Output Vector Editing for Memorization Mitigation in Large Language Models

Memorization suppression method in LLMs via output vector editing of MLP neurons. Tested on 4 models (360M-7B parameters), achieves 87.9% suppression on OLMo-7B with 6831 memorized sequences. Complementary approach to existing neuron ablation methods.

AI safety Alignment Papers

SIG

HYP

arXiv cs.LG·Jun 18

Neural Network Implementation of the Renormalization Group for Fault Diagnosis with Class Imbalance

RGNet, a neural network architecture based on the renormalization group, addresses class imbalance and multidimensional noise for fault diagnosis. The model hierarchically compresses feature space and captures both local details and global patterns. Tested on imbalanced AI4I dataset.

Papers Evals Benchmarks

SIG

HYP

arXiv cs.AI·Jun 18

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb is a benchmark of 372,648+ instances for evaluating defeasible abduction reasoning in language models. Best frontier models reach 65% under standard conditions but drop to 23.5% under rendering-robust evaluation, versus 100% for symbolic logic solvers. The benchmark includes three difficulty levels with polynomial-time verifiable gold standards.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 18

Optimizing Lithium Production Decisions under Geological, Demand, and Pricing Uncertainties: A POMDP Framework for Multi-Objective Decision Making

A POMDP framework optimizes lithium production decisions by incorporating geological, pricing, and demand uncertainties. POMDP solvers outperform human-inspired heuristics by dynamically adapting to price regimes (static, linear, exponential, stochastic) and optimally sequencing exploration, production, and technology choices.

Reasoning Reinforcement learning

SIG

HYP

arXiv cs.LG·Jun 18

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero uses LLM agents with tree search to discover adaptive RL training strategies. The system identifies that capacity parameters accumulate monotonically while regularization parameters oscillate. Across 4 GRPO tasks, discovered strategies outperform the base model by 9-140% and grid search by 6-15%.

Reinforcement learning AI Agents Reasoning

SIG

HYP

arXiv cs.AI·Jun 18

Towards an Agent-First Web: Redesigning the Web for AI Agents

Paper proposing web redesign to integrate AI agents as first-class citizens across three layers: access (HTTP headers, dual human/agent content), economics (token-based model, intent-based tiers), content (ATML, cryptographic provenance chain against epistemic recursion). Ten design principles for an agent-first internet.

AI Agents Infrastructure Regulation

SIG

HYP

arXiv cs.AI·Jun 18

NeSyCat Torch: A Differentiable Tensor Implementation of Categorical Semantics for Neurosymbolic Learning

NeSyCat Torch unifies neurosymbolic semantics (classical, fuzzy, probabilistic, neural) under a single truth definition parametrized by monads. Implemented in PyTorch, JAX, and HaskTorch, the framework interprets computational symbols via neural networks. On MNIST addition, outperforms LTN and DeepProbLog in speed and accuracy.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.CL·Jun 18

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

arXiv paper on improving long-context reasoning via data-centric approach rather than reward engineering. Data recipe targeting retrieval, multi-evidence synthesis, reasoning (~14K examples). Tests on Qwen3 (4B/8B/30B): +7.2/+3.2/+6.4 points across 7 long-context benchmarks, transfer to agentic tasks (+4.8 GAIA, +7.0 BrowseComp).

Reinforcement learning Reasoning AI Agents

SIG

HYP

arXiv cs.CL·Jun 18

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

Study evaluating 42 LLMs (proprietary and open-source) on their ability to measure item discrimination in reading comprehension. Models fail: Spearman correlation of 0.152 in direct prediction, 0.241 in CTT calibration. LLMs do not reliably capture how assessment items distinguish students of different proficiency levels.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.LG·Jun 18

The Illusion of Improvement: Reject Inference Strategies in Credit Scoring

Reject inference methods used in credit scoring to correct survival bias mask a structural failure: accuracy can improve while the ability to correctly reject defaulters collapses. Authors propose a controlled exploration strategy (approving 2-5% of rejected applicants) to diagnose this deterioration without strong statistical assumptions.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.LG·Jun 18

Task-Restricted Symmetries in Recurrent Weight Space

Study of functional redundancy in single-layer tanh RNNs using ordered real Schur coordinates. Authors identify nonnormal couplings removable with minimal loss on specific tasks (copy, flip-flop, sine generation), revealing task-dependent approximate functional invariances rather than universal weight-space symmetries.

Papers Reasoning

SIG

HYP

arXiv cs.AI·Jun 18

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

CaVe-VLM-CoT is a modular agentic-RAG framework reducing VLM hallucinations through a five-stage closed-loop pipeline (Extractor, Retriever, Solver, Citation Injector, Verifier). Ungrounded claims trigger targeted re-retrieval. 23 component-wise metrics and CaVeScore measure citation faithfulness and cross-modal grounding. Results: 87.1% accuracy on ScienceQA, 55.2% on MMMU.

Vision RAG AI Agents

SIG

HYP

arXiv cs.CL·Jun 18

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

Local de-identification framework for educational dialogues. Two-stage cascade: union proposer (lightweight encoders + deterministic rules) generates PII candidates, then binary Redact/Keep reviewer uses dialogue context and speaker role. Achieves 0.958 macro F1 on math tutoring transcripts, outperforms commercial API (0.706) and local LLM baseline (0.767), runs on single laptop.

RAG AI safety Papers

SIG

HYP

arXiv cs.CL·Jun 18

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

Study of collateral damage in LLM machine unlearning. Authors show damage propagates beyond the forget set following semantic distance gradients, and propose PreUnlearn, a pre-unlearning prediction method to audit risks before execution.

AI safety Alignment Papers

SIG

HYP

arXiv cs.CL·Jun 18

Dual Dimensionality for Local and Global Attention

Researchers propose Distance-Adaptive Representation (DAR): reduce key/value dimensionality beyond a local window in decoder-only Transformers. Nearby tokens require full representations for next-token prediction, while distant tokens can use 1/4 original dimensionality without performance loss. Tested on 70M–410M models and 1B fine-tuning.

Reasoning Infrastructure Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

BCL: Bayesian In-Context Learning Framework for Information Extraction

BCL is an optimization framework for information extraction using particle filtering and Bayesian updates to systematically refine label representations. It generalizes across sequence labeling and relation classification tasks, demonstrating consistent improvements over existing approaches across model scales.

Prompt engineering Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 18

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Morpheus is a morphology-aware neural tokenizer for agglutinative Turkish. The model uses differentiable Poisson-binomial dynamic programming to segment morphemes with 1.425 bits-per-character compression and MorphScore macro-F1 of 0.61 (vs ~0.32 for subword tokenizers). Lossless by construction: decode(encode(w)) = w.

Embeddings Papers Open source

SIG

HYP

arXiv cs.LG·Jun 18

Artemis: Anatomy-Resolved inTervention for Eliminating Multimodal NeuroImage confounderS

Artemis is a causal framework for graph neural networks addressing demographic confounders (age, sex) in multimodal brain imaging (fMRI + DTI). The method applies causal interventions at each brain region independently to learn invariant representations. Tested on ADNI, OASIS, and HCP benchmarks, it improves disease diagnosis and classification tasks.

Papers Reasoning Alignment

SIG

HYP

arXiv cs.LG·Jun 18

Why SWAVE May Not Be All You Need:A Concept-Evolution Retrospective on Complex-Valued Recurrent Language Models

SWave is a complex-valued recurrent language model (169M parameters) trained on FineWeb-Edu. The paper documents its evolution across three phases, identifying structural failures (cos-domination collapse) and validating critical components (ComplexNorm, Wave Propagation Scan). Final PPL: 22.0 at step 89,861.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

SCOPE-FL: A Strategy-proof Chain-based Optimal pareto efficient Federated Learning System

SCOPE-FL introduces a hierarchical Federated Learning system using the Top Trading Cycle algorithm for client selection. The mechanism guarantees Pareto efficiency and strategy-proofness, with reward distribution via Shapley value approximation and blockchain execution. Evaluation on MNIST, Fashion-MNIST, CIFAR-10 shows improvement over DA, IAS.

SIG

HYP

arXiv cs.LG·Jun 18

Enhanced Graph Neural Networks using K-Hop Gaussian Diffusion

New K-Hop Gaussian (KHG) diffusion method to enhance GNNs. KHG preprocesses graph data with multi-hop diffusion weighted by Gaussian, balancing local and global propagation. Outperforms standard message-passing, PPR, and Heat Kernel on benchmarks, especially on noisy graphs.

Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

Study on grokking (delayed transition from memorization to generalization). Authors show weight norm doesn't directly control grokking delay but acts through logit scale. Fixing norm and varying output temperature, they recover 85% of delay by matching logit scale. Effect is loss-dependent (cross-entropy vs MSE). Logit scale and softmax saturation are the proximal variables.

Papers Reasoning Evals

SIG

HYP

arXiv cs.LG·Jun 18

Quantum Annealing Enhanced Reinforcement Learning for Accurate Remaining Useful Lifetime Prediction

QAQL framework couples quantum annealing with Q-learning for remaining useful life (RUL) prediction in predictive maintenance. Each Q-value update encoded as QUBO solved on D-Wave Advantage system. Validated on NASA C-MAPSS and fleet maintenance datasets: statistically significant improvements over classical and quantum baselines.

Reinforcement learning Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 18

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

PSyGenTAB is a privacy-preserving framework for synthetic clinical tabular data generation formulated as constrained optimization solved via Augmented Lagrangian Method. It embeds configurable privacy constraints into training to preserve inter-feature clinical relationships and minority-class patterns while maintaining data utility for medical AI applications.

Benchmarks

SIG

HYP

arXiv cs.AI·Jun 18

Searching for Synergy in Shared Workspace Human-AI Collaboration

Study of human-AI team collaboration in shared workspace using Collaborative Gym and DiscoveryBench. Adding collaborators improves performance only with coordination structure. Scaffolding combining shared memory and human-in-the-loop gates increases performance, especially in three-person teams, by clarifying responsibilities and routing expertise.

AI Agents Multi-agent Evals

SIG

HYP

arXiv cs.AI·Jun 18

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench is a benchmark to evaluate strategic reasoning in Vision-Language Models (VLMs) using real-time strategy games. Built on Beyond All Reason, it offers multi-scenario evaluations, diagnostic mini-games targeting specific competencies, and a self-evolving generation framework. Current state-of-the-art VLMs fail at multi-agent coordination and complex task scaling.

Vision Reasoning Multi-agent

SIG

HYP

arXiv cs.AI·Jun 18

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

WorldLines is a long-horizon embodied agent benchmark testing memory in dynamic household environments. The dataset includes temporally extended traces with dialogues, actions, and object/device state changes. ObsMem, an observer-grounded memory framework, maintains visibility-aware memories and action-native state trails for state-informed decisions.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·Jun 18

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

Safety Reflection Pretraining inserts short safety reflections into pretraining corpora to establish self-monitoring directly in language modeling. On 1.7B models pretrained on FineWeb-Edu, the method improves safety classification accuracy and substantially reduces success rates of inference-stage and finetuning attacks.

AI safety Alignment Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 18

Analysing drivers and interdependencies in European electricity markets using XAI

Study combining deep neural networks with XAI (SHAP, SSHAP) to analyse 39 European electricity bidding zones. Identifies solar energy as disproportionate price driver, gas prices as dominant factor, and interconnections revealing interdependence of electricity markets.

Evals Papers

SIG

HYP

arXiv cs.CL·Jun 18

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

Paper presents speech-driven approach for Chinese dialect discrimination. Combines MFCC features, HMM-DNN speech recognition model, attention mechanism and CNN. Evaluation on two benchmark Chinese dialect corpora shows improvement over state-of-the-art methods.

Voice Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 18

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

DICE improves long-document retrieval by splitting documents into chunks, encoding each independently, then aggregating vectors into a single representation. On LongEmbed, gains reach 90.0 for Dream Passkey >4k (vs 30.0) and 74.0 for Needle >4k (vs 23.3). The approach reduces Evidence Dilution Index (EDI) in 92.8% of cases.

RAG Embeddings Vector search

SIG

HYP

arXiv cs.CL·Jun 18

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST is a self-supervised framework improving LLM pragmatic reasoning through counterfactual reasoning traces. Without human-labeled data, it combines supervised fine-tuning and reinforcement learning. On 4 benchmarks (PragMega, Ludwig, MetoQA, AltPrag), it gains +5.37% and +5.50% absolute for Qwen3-8B and Qwen3-14B.

Reasoning Reinforcement learning Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 18

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

RegMix-D extends RegMix by leveraging full loss trajectories from proxy runs, not just endpoint losses, to predict optimal data mixtures at multiple training stages. Tested on 25B tokens of Pile with a 1B model, RegMix-D outperforms RegMix and DoReMi across 13 downstream tasks while using 75% less proxy compute.

Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 18

Efficient Financial Language Understanding via Distillation with Synthetic Data

Distillation framework with synthetic data for financial sentiment analysis. Knowledge transfer from large instruction-tuned teacher to compact student models. Clustering-based seed selection generates synthetic examples via few-shot prompting. Compact model outperforms teacher on complex/noisy text with minimal supervision.

Fine-tuning RAG Prompt engineering

SIG

HYP

arXiv cs.LG·Jun 18

ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets

ThousandWorlds is an ML benchmark for climate emulation of potentially habitable exoplanets. The dataset contains ~1800 simulations from 5 global climate models mapping 8 planetary parameters to 3D atmospheric fields. Three nested subsets and two evaluation protocols test 7 baselines; GP-based methods outperform standard deep learning.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.LG·Jun 18

A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

Evaluation protocol for single-image-to-3D mesh quality using VLM judges (vision-language models). Authors demonstrate that cheap proxies (CLIP similarity, geometry validity stats) fail to correlate with perceived quality. Their VLM-judge protocol with position-bias correction achieves Cohen's kappa = 0.66 between two independent judge families.

Vision Evals Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

Measurement noise limits the advantage of nonlinear models over linear models in biomedical prediction

arXiv paper demonstrates that on biomedical tabular data, measurement noise limits the advantage of nonlinear models (deep networks, gradient boosting) over linear regression. Degree-k interactions are attenuated by the k-th power of feature reliability, while linear components are attenuated only once. Analysis of 140 UK Biobank tasks confirms this noise signature.

Benchmarks Evals

SIG

HYP

arXiv cs.AI·Jun 18

CEO-Bench: Can Agents Play the Long Game?

CEO-Bench evaluates agents' ability to handle complex long-horizon tasks by simulating a 500-day startup operation. The agent manages pricing, marketing, budgeting through a Python interface. Only Claude Opus 4.8 and GPT-5.5 exceed the $1M starting balance, neither consistently profitable.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

JetFlow improves speculative decoding by combining parallel drafting efficiency with branch-wise causal conditioning. On H100 GPUs, it achieves 9.64x speedup on MATH-500 and 4.58x on open-ended conversations, outperforming existing tree-based methods on dense and MoE Qwen3 models.

Benchmarks Code generation Open source

SIG

HYP

arXiv cs.LG·Jun 18

TMR-GGNN: Credit Card Fraud Detection based on Time-Aware Multi-Relational Guided Graph Neural Network

TMR-GGNN, a time-aware multi-relational graph neural network, detects credit card fraud by modeling heterogeneous interactions between customers, merchants, devices, and IPs. The model combines temporal relational attention, contrastive learning, and a composite loss function (InfoNCE + Focal Loss) to handle imbalanced data and reduce false negatives.

Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 18

R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

R2D-RL bridges RoboCup 2D Soccer Simulator (RCSS2D) to Python MARL workflows via shared-memory communication. The environment supports full-field and scenario-based training with discrete/hybrid action spaces, action masks, EPV-based reward shaping, and parallel execution. Includes 11-vs-11 full-field benchmarks and baseline results.

Multi-agent Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

LLM Parameters for Math Across Languages: Shared or Separate?

Mechanistic analysis of mathematical reasoning in multilingual LLMs. Math-associated parameters exhibit partial cross-lingual overlap, concentrated in intermediate layers. English produces the largest set of math-relevant parameters, while lower-resource languages reveal smaller parameter sets.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Montreal Forced Aligner 3.0, the reference tool since 2016 for forced speech-to-text alignment, achieves state-of-the-art performance on English, Japanese, and Korean with boundary errors <15ms. New capabilities: model adaptation, cross-language phone remapping, expanded language/dialect coverage, harmonized IPA dictionaries.

Voice Benchmarks Open source

SIG

HYP

arXiv cs.CL·Jun 18

Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications

Framework for customization and efficient deployment of LLM-based multi-agent systems in enterprise settings. Combines continual pretraining, supervised fine-tuning, and preference optimization to adapt compact models to specialized domains. Integrates speculative decoding and FP8 quantization to reduce latency and costs. Achieves 4.48x throughput speedup while maintaining performance.

Multi-agent Fine-tuning Business

SIG

HYP

arXiv cs.CL·Jun 18

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

MCompassRAG improves RAG systems by using topic-level metadata as a semantic compass for paragraph-level retrieval. The method enriches chunk representations with topic signals in the same embedding space and trains a lightweight retriever via LLM-teacher distillation. Across six benchmarks, it gains 8.24% in information efficiency with 5× lower latency than efficient RAG baselines.

RAG Embeddings Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

CDDTLDA framework for Chinese dialect discrimination with scarce annotation resources. Uses transfer learning on ASR models, data augmentation (speed, pitch, noise), and self-attention to capture shared semantic features. Outperforms state-of-the-art on two benchmark corpora.

Voice Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Steerable Cultural Preference Optimization of Reward Models

Novel SCPO algorithm for training reward models that balance diverse cultural preferences across subcommunities. Achieves 7-point improvements for minority reward models on PRISM and GlobalOpinionQA (7 countries), with 280% better training data efficiency than full-finetuning.

Alignment Reinforcement learning Evals

SIG

HYP

arXiv cs.CL·Jun 18

PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes

PEC-Home is a simulated home dataset for interpreting progressively elliptical commands in smart homes. Current assistants (including GPT-4o) fail to execute these abbreviated commands accurately due to accumulated shared context, even when equipped with dialogue history retrieval.

AI Agents Benchmarks RAG

SIG

HYP

arXiv cs.CL·Jun 18

TW-LegalBench: Measuring Taiwanese Legal Understanding

TW-LegalBench evaluates 13 LLMs on Taiwanese law using 16,000+ multiple-choice questions, 117 open-ended essays, and 14,000+ legal judgment prediction cases. Top models exceed lawyer qualification threshold (11%) but fall short for judges/prosecutors (1-2%). Models struggle to cite exact legal articles.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

Improving Medical Communication using Rubric-Guided Counterfactual Recommendations

LM-guided counterfactual recommendation pipeline to improve medical communication in text-based telemedicine. System identifies interpretable features (tone, personalization, clarity, completeness) and recommends minimal communication changes predicted to increase positive feedback (+6.41% mean gain). Modifications preserve medical content and physician control.

Reasoning Evals RAG

SIG

HYP

arXiv cs.CL·Jun 18

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

Benchmark of 1,200 clinical documents with 9,184 uncertainty annotations across five levels. LLMs poorly preserve uncertainty expressions (less than 50% of cases) and struggle with nuanced distinctions between adjacent levels. Reveals a failure mode missed by standard metrics.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.LG·Jun 18

ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots

ASTRA is an air traffic control training simulator automating pilot roles through speech recognition, instruction interpretation, and response generation. The system reduces Word Error Rate from 107.80% to 23.45% on Singaporean-accented aviation speech, and evaluates trainee radiotelephony communications achieving 91.7% accuracy, 88.2% brevity, and 86.9% completeness scores.

Voice Fine-tuning Evals

SIG

HYP

arXiv cs.CL·Jun 18

Approximate Structured Diffusion for Sequence Labelling

New approach combining diffusion and CRF for sequence labelling in NLP. Method conditions a CRF on the full label sequence (noisy), bypassing span limitations of standard CRFs. Results: 16.5% error reduction on POS-tagging.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

Sparse Autoencoders (SAEs) decompose activations into interpretable features, but this study shows that clamping a 'harmful' feature does not eliminate the behavior—it can recover via other residual pathways. Even with active intervention, 95.8% behavior recovery is achievable in refusal-steering, exposing a gap between feature-level control and behavioral completeness.

AI safety Alignment Evals

SIG

HYP

arXiv cs.LG·Jun 18

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

PROPEL is a framework training task generators via RL to create optimally difficult problems for agent learning. A lightweight probe predicts solver pass rate without repeated rollouts, reducing evaluation to a single forward pass. On code and SWE tasks, learnable-frontier generation increases from 10.1% to 20% (Qwen2.5-3B) and 9.8% to 19.6% (Qwen3.5-27B).

Reinforcement learning AI Agents Code generation

SIG

HYP

arXiv cs.LG·Jun 18

Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing

Gaussian Mixture Attention (GMA) replaces standard attention with probabilistic routing through K learned Gaussian mixture components. Queries and keys map to responsibility vectors in a shared latent space. GMA avoids explicit N×N matrix materialization, reducing memory complexity to O(NK) instead of O(N²). Competitive on long-context classification, but behind SDPA and Mamba on WikiText-103.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 18

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE is a stochastic prompt optimization framework using multi-agent guided exploration. Compares three strategies: error-informed random search, genetic algorithm, and SAGE with diagnostic code execution. Deployed on mental-health chatbot: 8 cycles of noisy A/B tests compound into statistically robust next-day retention gain.

Prompt engineering AI Agents Multi-agent

SIG

HYP

arXiv cs.LG·Jun 18

A Survey on Data-Driven Models for Soil Moisture Regression and Classification

Survey of AI-based models for soil moisture estimation and classification. Five categories compared: statistical time-series, geostatistical methods, classical ML, deep learning, and Bayesian approaches. Data-driven methods provide flexible alternatives to computationally expensive physics-based models.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 18

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

New LLM inference scheduler replacing explicit length prediction with lightweight statistical signals and dynamic priority boosting. Reduces P99 TTLT by 35-50% vs SRPT with perfect length knowledge, and TTFT by 34-47% across production and open-source traces.

Benchmarks Infrastructure Reasoning

SIG

HYP

arXiv cs.LG·Jun 18

Ghost Attractor Networks: Basin-Structured Dynamical Decoders for Closed-Loop Sequential Generation

Ghost Attractor Networks introduce an efficient dynamical decoder for sequential generation in robotics. With 2.3M parameters, it matches the offline accuracy of a 1.07B-parameter Diffusion Transformer (462× fewer parameters, 32× lower latency). On LIBERO-10, phase conditioning improves success rate by 13.5 percentage points over MLP baseline.

Code generation Robotics Reasoning

SIG

HYP

arXiv cs.LG·Jun 18

Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

RL framework inspired by neuroscience that disentangles dynamics-specific and reward-specific features using locally linear embeddings (LLE) and adaptively fuses representations via attention mechanism. Improves learning efficiency on benchmark tasks compared to conventional RL approaches.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

Self-CTRL: Self-Consistency Training with Reinforcement Learning

Self-CTRL optimizes consistency between language models' self-explanations and behavior via reinforcement learning. On probabilistic reasoning tasks, the method improves R² correlation from 0.24 to 0.64. In constitutional AI, it increases refusal prediction from 36% to 92% and reduces HarmBench failure rate from 15.0% to 0.5%.

Reinforcement learning Alignment AI safety

SIG

HYP

arXiv cs.LG·Jun 18

Fisher Width: A Geometric Measure of Complexity on Statistical Manifolds

New geometric complexity measure called Fisher width, a Fisher-geometric analogue of Gaussian width on statistical manifolds. Replaces Euclidean geometry with Fisher information metric to capture local statistical curvature. Develops foundational theory with generalization bounds and computable estimators, validated on MNIST.

Papers Benchmarks Evals

SIG

HYP

arXiv cs.LG·Jun 18

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

Structural pruning framework for Mixture-of-Experts models operating at channel level rather than expert level. Attribution-based method reformulates pruning as channel-score coverage maximization. Experiments on DeepSeek and Qwen models achieve 50% structured pruning with 4-bit quantization, 5.27× memory reduction on Qwen3-30B-A3B.

DeepSeek Qwen Benchmarks

SIG

HYP

arXiv cs.LG·Jun 18

P$^2$CE: Model-Agnostic Plausible Pareto-Optimal Counterfactual Explanations

P²CE generates plausible Pareto-optimal counterfactual explanations for ML models. The algorithm uses isolation forests and SHAP values to balance feasibility, plausibility, and computational efficiency. Evaluated on 3 datasets, it outperforms existing methods in solution quality and speed.

Evals

SIG

HYP

arXiv cs.LG·Jun 18

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

SAGE is a post-hoc method to improve selective unlearning in LLMs. It corrects final update vectors by suppressing components damaging retention, without rerunning the original unlearning pipeline. Tested across multiple methods and scales, SAGE reduces the forget-retain trade-off.

Alignment Papers

SIG

HYP

arXiv cs.AI·Jun 18

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench is a safety evaluation benchmark for LLMs in AI4Science workflows. It covers 7 disciplines, 31 sub-disciplines, and 10 risk dimensions. The authors evaluate mainstream and science-oriented LLMs to diagnose safety gaps across risk categories.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.AI·Jun 18

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

ProfiLLM is an agentic LLM pipeline deployed at DiDi to extract semantic user profiles from massive behavioral logs. The system uses 27 analytical tools to mine platform-scale data and generates utility-aligned profiles, achieving +6.14% AUC improvement and +0.47% GMV gain in A/B testing.

AI Agents Llama RAG

SIG

HYP

arXiv cs.AI·Jun 18

What Must Generalist Agents Remember?

Theoretical paper on memory requirements for generalist agents. Proves that agents performing near-optimally across multiple domains must maintain distinct memory distributions at observational bottlenecks. Memory enables domain disambiguation, transition-model reconstruction, and planning.

AI Agents Reasoning Papers

SIG

HYP

arXiv cs.AI·Jun 18

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

BeliefDiffusion combines diffusion models and Model Predictive Control for navigation in partially observable environments. The framework generates multimodal belief distributions and plans efficient navigation strategies. Experiments on synthetic maps: outperforms RL and other generative approaches in success rate and path efficiency.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·Jun 18

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

ForecastBench-Sim is a forecasting benchmark built on Freeciv game simulations. Models receive a structured game state and predict hidden future states; the benchmark continues the simulation to score forecasts. Enables questions at arbitrary time horizons, counterfactual worlds, and rare events.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.LG·Jun 18

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

Study shows SFT overtraining can invert model rankings during RLVR fine-tuning. On Qwen2.5-Coder-3B, increasing SFT depth raises pre-RL pass@1 but reduces GRPO pass@10 from 0.806 to 0.481. Pre-RL entropy positively correlates with RLVR outcomes (ρ=+0.69). Two-stage entropy-based diagnostic identifies high-risk checkpoints.

Reinforcement learning Fine-tuning Reasoning

SIG

HYP

arXiv cs.AI·Jun 18

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

ThinkDeception introduces a progressive reinforcement learning framework for interpretable multimodal deception detection. Using MLLMs, it converts binary classification into explicit reasoning via Chain of Thought. VAC-GRPO with curriculum learning stratified into 4 difficulty tiers achieves SOTA on mainstream benchmarks.

Reasoning Reinforcement learning Vision

SIG

HYP

arXiv cs.CL·Jun 18

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

RPCL, a training-only framework for multimodal emotion-cause pair extraction, improves pair-confidence robustness. Using margin constraints and contextual corruption, it increases Pair F1 by 2.58–2.83 points on ECF/MECAD/MEC4 without changing inference.

Papers Benchmarks Vision

SIG

HYP

arXiv cs.CL·Jun 18

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

ScholarSum introduces a hierarchical knowledge graph framework for abstractive scientific summarization. The system organizes documents into semantically coherent units, generates an initial draft, then refines it through iterative verification and rewriting to ensure logical coherence and factual faithfulness.

Papers RAG Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

arXiv study assessing LLM ability to interpret negation in figurative language. Researchers annotate an existing dataset and evaluate multiple models. Finding: negation combined with figurativeness presents particular challenge, with performance heavily dependent on prompt style.

Evals Prompt engineering Reasoning

SIG

HYP

arXiv cs.AI·Jun 18

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

TxBench-PP is a verified benchmark evaluating AI agents on small-molecule preclinical pharmacology. 100 evaluations span mechanism-of-action, pharmacodynamics, compound-target engagement, and safety. Across 16 configurations (11 models, 4,800 trajectories), Claude Opus 4.8 achieves 59.3% success rate, GPT-5.5 55.3%. No system reliably masters these decisions.

AI Agents Benchmarks Claude

SIG

HYP

arXiv cs.CL·Jun 18

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

CoreMem introduces a memory architecture for personalized dialogue agents on edge devices (8 GB VRAM). Replaces cosine similarity with Fisher-Rao metric for retrieval and uses Fisher-guided token distillation for compression. Achieves +4.51 pp gains in open-domain reasoning and +4.17 pp in temporal reasoning on LOCOMO and LongMemEval-S benchmarks.

AI Agents RAG Embeddings

SIG

HYP

arXiv cs.CL·Jun 18

Continuous Audio Thinking for Large Audio Language Models

Continuous Audio Thinking (CoAT) adds a continuous latent workspace to large audio language models to preserve acoustic information (phonetics, prosody, affect, pitch) before text generation. Tested on Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo, CoAT improves performance on audio reasoning, music classification, and transcription with no additional decoding cost.

Reasoning Voice Qwen

SIG

HYP

arXiv cs.CL·Jun 18

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

Activation steering improves synthetic data generation for low-resource languages. Two strategies tested: Language Steering (linguistic identity) and Quality Steering (well-formedness). Evaluation across 4 open-source LLMs, 11 languages, classification tasks. Early-layer steering increases diversity and downstream performance.

Prompt engineering Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

PhysAssistBench is an interactive medical assistance benchmark with 1,296 physician-validated turns built from real MIMIC-IV cases. It evaluates LLMs' ability to coordinate clinical knowledge, patient communication, and EHR system interaction within single dialogues. Experiments show current models remain unreliable in this setting.

Benchmarks AI Agents Multi-agent

SIG

HYP

arXiv cs.AI·Jun 18

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

ARIADNE is a training-free framework for dynamic adapter selection at inference time. It represents each adapter through centroids computed from embeddings of its training set. Tested on Llama 3.2 1B across 23 NLP tasks, it recovers 97.44% of upper-bound performance and achieves 89.7% average selection accuracy on 44 tasks.

Fine-tuning Llama Benchmarks

SIG

HYP

arXiv cs.AI·Jun 18

Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

New formal theory (HACD-H) modeling emergence of social intelligence in long-term human-AI interaction. Unified framework integrating emotional adaptation, social memory, and personality consistency. Study on 14,700 conversation turns reveals negative correlation between social intelligence and social cognitive energy (r=-0.391, p<0.001), with developmental phase-transition patterns.

Reasoning AI Agents Papers

SIG

HYP

arXiv cs.CL·Jun 18

VISUALSKILL: Multimodal Skills for Computer-Use Agents

VISUALSKILL introduces hierarchical multimodal skills for computer-use agents. Combining authored documentation with live UI exploration, the system improves Claude Opus 4.6 performance by +15.3 points on CUA-World and OSExpert-Eval (0.456 vs 0.303 baseline). Visual figures outperform text-only descriptions (+8.3 points).

Claude AI Agents MCP

SIG

HYP

arXiv cs.LG·Jun 18

CODEBLOCK: Learning to Supervise Code at the Right Granularity

CodeBlock is a structure-aware sparse supervision framework for code LLM fine-tuning. It selects syntactically coherent code blocks rather than isolated tokens, estimating utility via generalized cross-entropy and data-flow signals. On 6 code-generation benchmarks, CodeBlock outperforms full-token SFT while using only 1.9% of supervised response tokens.

Code generation Fine-tuning Papers

SIG

HYP

arXiv cs.LG·Jun 18

DRIFT: Refining Instruction Data via On-Policy Data Attribution

DRIFT refines SFT training data distribution using on-policy Influence Functions. The method uses model rollouts as validation targets to minimize proximity gap and debias gradient norm bias. Experiments on 7B instruction and reasoning models show consistent performance ceiling improvements over existing curation baselines.

Fine-tuning Reinforcement learning Evals

SIG

HYP

arXiv cs.AI·Jun 18

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

Novel LLM personalization: store user facts as surgical edits in a hash-keyed memory table (Engram) instead of global LoRA. Reduces memory footprint by 33,000x, improves indirect-reasoning accuracy by 5.6x on average, and enables stacking multiple users without cross-contamination.

Fine-tuning Reasoning Papers

SIG

HYP