June 2026

2731 articles

CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction

CoCoGEC is a counterfactual generation framework for robust grammatical error correction. The method generates training variants with altered contexts while preserving error patterns, then selects instances with flipped labels and high MI coefficient. F0.5 gains of +9.9 to +20.8 points on BEA-19, CoNLL-14, and TEM-8.

Papers Benchmarks Code generation

SIG

HYP

arXiv cs.CL·Jun 16

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

NVIDIA introduces Nemotron 3 Ultra, a 550B-parameter (55B active) Mamba-Transformer MoE hybrid model pre-trained on 20T tokens with 1M context length. Uses SFT, RL, and multi-teacher distillation. Achieves ~6x inference throughput of public LLMs with comparable accuracy. Base, post-trained, and quantized checkpoints, training data, and recipe open-sourced on HuggingFace.

AI Agents Reasoning Open source

SIG

HYP

arXiv cs.CL·Jun 16

Beyond Layer Importance in Layer-wise Sparsity: An Inter-Layer Perturbation-Absorption Perspective

Study on layer-wise redundancy in LLMs. Authors characterize how layers absorb or amplify perturbations during pruning: early layers amplify, middle and late layers absorb. They propose absorption-aware correction using a per-layer absorption coefficient, improving OWL and AlphaPruning by 7.13% perplexity reduction and 1.02% zero-shot accuracy boost at 70% sparsity.

Papers Benchmarks Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 16

Encode Errors: Representational Retrieval of In-Context Demonstrations for Multilingual Grammatical Error Correction

Retrieval method for in-context demonstrations using Grammatical Error Representations (GER) for multilingual grammatical error correction. On 8B open-source models, results match GPT-4o-mini and Deepseek2.5. For low-resource languages, F₀.₅ scores improve up to 1.20× over baseline.

RAG Prompt engineering Benchmarks

SIG

HYP

arXiv cs.AI·Jun 16

PrologMCP: A Standardized Prolog Tool Interface for LLM Agents

PrologMCP exposes Prolog as a stateful tool via Model Context Protocol for LLM agents. Tested on PARARULE-Plus with Claude Sonnet 4.6, GPT-4.1, and o4-mini, the system achieves 1.00 accuracy on the general set and 0.99–1.00 on the challenging set, outperforming reasoning models on deductive tasks.

MCP AI Agents Reasoning

SIG

HYP

arXiv cs.AI·Jun 16

CogGuard: Cognitive and Operational Profiling for Proactive Warning in Edge Intelligent Services

CogGuard is a proactive-warning framework for edge intelligent services using offline LLMs to build cognitive and operational profiles, then online SLMs for real-time scoring. Achieves 48% reduction in profile construction time and 19% in distributed fine-tuning on heterogeneous clusters. Reduces prediction error by 15.4% vs strongest baseline on educational datasets.

Reasoning Fine-tuning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 16

Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

Multimodal fusion framework for time-to-event prediction (PE mortality, CVD outcomes) aligning CT and longitudinal EHR representations using foundation models. Four strategies tested (late fusion, contrastive alignment, cross-attention, co-attention) on 3,099–2,951 patients. Contrastive fusion improves concordance index by 1.5–5.4% vs unimodal baselines.

Benchmarks Embeddings Vision

SIG

HYP

arXiv cs.AI·Jun 16

A Formal Framework for Declarative Agentic AI in Business Process Analysis

Formal AGO framework for business process analysis with agentic AI. Precisely defines agents, goals, and entities using set theory and mathematical logic. Automatically generates BP workflows with soundness and completeness guarantees.

AI Agents Reasoning Business

SIG

HYP

arXiv cs.AI·Jun 16

Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

Mask-Proof is an automated pipeline converting real mathematical proofs into verifiable masked-step tasks. The benchmark contains 292 curated problems. Testing 17 models shows reasoning-enhanced models outperform standard models by 12-27%. The evaluator achieves 96.8% agreement with expert annotators.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 16

ChatPlanner: A Large Language Model Framework for Personalized Public Transit Routing

ChatPlanner is a framework using fine-tuned LLMs with RAG to extract user preferences from natural language and integrate them into public transit routing optimization. Evaluated on 8 personas and 5 contexts, the system combines fine-tuning (output structure) and RAG (query-specific context) to identify solutions overlooked by existing planners.

RAG Fine-tuning Prompt engineering

SIG

HYP

arXiv cs.AI·Jun 16

Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents

Base Sequence Analysis framework encodes LLM-powered autonomous agent behavior into symbolic sequences (X/E/P/V). Analysis of 347 production ReAct traces reveals P-X-P pattern reduces success by 10.4% and P-ratio negatively predicts success (r=-0.256). Governor runtime intervention system achieves +6.2% absolute success increase and 44% token reduction. Validated on 2,000 SWE-agent trajectories.

AI Agents Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 16

A Definition of Good Explanations and the Challenges Explaining LLM Outputs

Paper proposes a philosophical definition of good explanations based on counterfactual reasoning, accounting for the interlocutor's prior beliefs. Analyzes why LLM outputs are particularly challenging to explain.

Reasoning AI safety Alignment

SIG

HYP

arXiv cs.AI·Jun 16

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Visual-Seeker is a multimodal deep search agent that enhances visual reasoning in MLLMs for complex scenarios. The approach uses an active visual reasoning data pipeline and 5K synthetic multimodal trajectories for training. The agent achieves SOTA performance across five multimodal search benchmarks, surpassing some proprietary models.

AI Agents Vision Multi-agent

SIG

HYP

arXiv cs.LG·Jun 16

EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction

EnvShip-Bench is a unified benchmark for short-term vessel trajectory prediction built from raw AIS data from the Danish Maritime Authority and NOAA. The benchmark standardizes the forecasting protocol (10 min observation, 10 min prediction, 20s sampling) and provides environmental and nearby-vessel contextual annotations to support context-aware modeling.

Benchmarks

SIG

HYP

arXiv cs.AI·Jun 16

OSGuard: A Benchmark for Safety in Computer-Use Agents

OSGuard is a dual-granularity benchmark for evaluating safety in computer-use agents. It combines action-level guardrail decisions and risk-augmented execution evaluation. Current multimodal guardrails perform well on isolated action judgments but fail to ensure reliable end-to-end safety.

AI Agents AI safety Benchmarks

SIG

HYP

arXiv cs.LG·Jun 16

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

PolyKV optimizes KV cache compression by applying heterogeneous strategies per transformer layer instead of uniform policies. On LLaMA-3.1-8B and Qwen3-8B with 512-token KV budget, PolyKV recovers 54.5% and 25.7% of the LongBench performance gap versus FullKV.

Benchmarks Infrastructure Reasoning

SIG

HYP

arXiv cs.AI·Jun 16

Feature Attribution in Directed Acyclic Graphs Using Edge Intervention

DAG-SHAP, a novel feature attribution method based on edge intervention in directed acyclic graphs. Improves existing Shapley-based methods by capturing both externality and exogenous influence of features simultaneously. Code available on GitHub.

Evals Papers

SIG

HYP

arXiv cs.AI·Jun 16

Forced Deferral: Manipulating Routing Decisions in Multimodal LLM Cascades

Researchers identify a vulnerability in multimodal LLM cascades: an adversarial attack (Forced Deferral Attack) manipulates weak-model confidence to force routing to the strong model, increasing compute costs without targeting answer correctness.

AI safety Vision Benchmarks

SIG

HYP

arXiv cs.AI·Jun 16

Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds

Study on reward hacking in LLM-based agents using an adapted AI Safety Gridworlds framework. Models (1.5B–14B) systematically exploit misspecified objectives to maximize observed rewards while failing hidden safety objectives. RL optimization amplifies the problem and resists standard mitigations (exploration, regularization).

AI Agents Reinforcement learning AI safety

SIG

HYP

arXiv cs.AI·Jun 16

Large Language Models as Optimizers: A Survey of Direct vs. Tool-Augmented Approaches and Their Performance Frontiers

Survey of LLMs as mathematical optimizers across three paradigms: direct optimization (iterative prompting), tool-augmented optimization (translating to formal specs), and tool-creating optimization (discovering reusable algorithms). Identifies critical reasoning gap and proposes trade-offs between future potential and auditability.

Reasoning AI Agents Tools

SIG

HYP

arXiv cs.LG·Jun 16

AI for Social Good: An Investigation of the Causal Relationship Between Environmental Regulations and Their Effects on Air Pollution in London, UK

Bayesian study of air pollution regulation effects in London (2010-2020). A Bayesian LSTM model integrating PM2.5 observations, meteorology, and 32 policy measures estimates average reduction of 1.88 µg/m³ (95% CI: 1.64-2.12), a relative -12.35% decrease. Effects strengthened 2013-2019.

Papers Reasoning Evals

SIG

HYP

arXiv cs.LG·Jun 16

Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model

Theoretical study demonstrating that neural networks trained with gradient-based methods can achieve optimal computational-statistical tradeoff for Gaussian single-index models. Proposed algorithm (two-layer network) achieves sample complexity Õ(d^{s*/2} ∨ d) matching SQ lower bounds, with extension to k-sparse case via weight perturbation technique.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 16

Towards a Unified Generative Model for Scarce Time Series with Domain Experts

TimeMoDE, a framework combining Diffusion Transformers and Mixture-of-Experts, generates realistic time series under data scarcity. Pre-trained on multi-domain datasets, it uses Domain Prompts to condition expert assignment and incorporates diffusion timestep signals for adaptive denoising. Outperforms existing methods in few-shot generation settings.

SIG

HYP

arXiv cs.LG·Jun 16

High-Dimensional Random Projection for Activation Steering in Language Models

HiDRA, a training-free activation steering method, uses high-dimensional random projection to improve behavioral control of LLMs. It outperforms linear difference-in-means approaches by capturing discriminative signals in nonlinear feature subspaces, with consistent gains across multiple model families.

Reasoning Alignment

SIG

HYP

arXiv cs.LG·Jun 16

Diversity-Driven Offline Multi-Objective Optimization via Nested Pareto Set Learning

DOMOO, an offline multi-objective optimization method, addresses out-of-distribution (OOD) issues by combining cumulative risk control and nested Pareto set learning. Introduces IGD_offline, a tailored indicator for offline settings, to select diverse and convergent solutions.

Benchmarks

SIG

HYP

arXiv cs.CL·Jun 16

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

Reinforcement learning post-training method (GRPO) to improve hateful and propagandistic meme detection in thinking-based MLLMs. +2.1% improvement on Hateful Memes (79.9%→82.0%) and +7.6 macro-F1 points on ArMeme (0.536→0.612) with chain-of-thought explanations. Code and data publicly released.

Reinforcement learning Reasoning Vision

SIG

HYP

arXiv cs.CL·Jun 16

Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning

ReRULE improves LLM unlearning via off-policy replay for hard cases. The method stores low-reward rollouts near the forget/retain boundary in a replay buffer and reuses them through importance-sampled updates. On MUSE-Books, it increases Retain Quality from 46.3 to 56.2 with +5–11% training overhead.

Reinforcement learning AI safety Alignment

SIG

HYP

arXiv cs.LG·Jun 16

Unlocking Latent Dimensions: Exploring Representations of Large-Scale X-ray Scattering Data using Variational Autoencoders

Variational Autoencoder (C-VAE) trained on 1.5 million X-ray scattering images to learn low-dimensional representations. Model reveals organized clusters and generates controlled synthetic images. Deployed without retraining across two synchrotron facilities, outperforms DINOv3 in interpretability. Integrated into Latent Space Explorer (MLExchange).

Vision Benchmarks Tools

SIG

HYP

arXiv cs.LG·Jun 16

Phase-Localized Curation Does Not Help: A Negative Result on Per-Phase Metric Selection for Demonstration Filtering

Negative result on per-phase metric selection for demonstration filtering in robotic manipulation. Across three LIBERO pick-and-place tasks, phase-gated curation never outperforms global metrics (Task 1: 86.0 vs 92.0). Rank-aggregating defect signals across phases dilutes informative scores. Authors recommend identifying a single defect-informative metric over phase-based decomposition.

Robotics Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 16

Toward Vibe Medicine: A Self-Evolving Multi-Agent Framework for Clinical Decision Support

VIBEMed is a multi-agent framework with self-evolution mechanism for clinical decision support. Three specialized agents (diagnostic, therapeutic, evolution manager) integrate patient session history and past outcomes to iteratively improve medical decisions. Results on oncology planning and complex cases.

Multi-agent AI Agents Reasoning

SIG

HYP

arXiv cs.LG·Jun 16

TriAdReview: Triangular Adversarial Review Architecture for Multi-Model Technical Document Generation

TriAdReview proposes a triangular adversarial architecture with two reviewer models (engineering and security perspectives) to improve technical document generation. Across 75 experiments, the triple model achieves +10.1% over baseline (26.2 vs 23.8/50, p<0.05), with strong gains on security audit (+27.6%), code generation (+20.8%), architecture design (+15.6%), but -7.5% degradation on requirements analysis.

Multi-agent Code generation Benchmarks

SIG

HYP

arXiv cs.LG·Jun 16

Contextual Bandits for Maximizing Stimulated Word-of-Mouth Rewards

Contextual multi-armed bandit framework to optimize stimulated word-of-mouth in social networks. The approach learns individual spillover probabilities and ranks connected users to maximize rewards. Experiments on real-world network datasets show improved targeting precision and rewards compared to baseline methods that ignore spillover heterogeneity.

Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 16

Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines

Method to distinguish whether LLM score drift stems from the product or the judge model itself. Uses human-labeled anchor set and betting e-process to detect silent judge model changes. Detects 100% of judge drift with zero false positives on product, outperforms industry-standard rolling z-test.

Evals Benchmarks AI safety

SIG

HYP

arXiv cs.CL·Jun 16

Context Compression Is Not One Thing: Readable Symbolic Re-expression vs. Coherent Summary at Matched Budget

Telegraph English, a readable symbolic format, rewrites retrieved passages into structured entity-relation statements for context compression. On MuSiQue, TwoWiki, and HotpotQA, it outperforms three matched-budget baselines (deletion, truncation, sub-sampling) by 13–20 F1 points, and exceeds coherent prose summaries on the hardest dataset.

RAG Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 16

{\alpha}-Fair Insurance Pricing: A Fairness Continuum

Paper proposing α-FISP, an optimization framework for insurance pricing balancing actuarial fairness (risk differentiation) and solidarity fairness (risk pooling). Constrained formulation guarantees solvency with parameter α tracing a continuum between both approaches. Numerical validation on US regulatory regimes.

Papers Regulation

SIG

HYP

arXiv cs.LG·Jun 16

A Comparative Study of Graph Neural Network Layer Selection for Interaction Modelling in Driving Trajectory Prediction

Comparative study of 19 GNN layer types for trajectory prediction in autonomous driving. ARMA, Chebyshev, and topology-aware layers consistently outperform others. Sum-based aggregation, multi-head attention, and distance-weighted hops significantly improve prediction accuracy.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 16

Controlled Dynamics Attractor Transformer

CDAT couples Transformer attention with continuous attractor neural network (CANN) dynamics. The model combines von Mises-Fisher attention energy with Hopfield refinement and excitation-inhibition modulation. Achieves state-of-the-art results on graph anomaly detection and graph classification benchmarks.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 16

Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models

ASAG, a training-free method analyzing attention distributions, detects overthinking in reasoning models and adaptively stops generation. Tested on DeepSeek-R1-Distill and Qwen3, it improves accuracy by 3.2% while reducing generated tokens by 40% on Qwen3-8B.

Reasoning DeepSeek Qwen

SIG

HYP

arXiv cs.CL·Jun 16

Spokes: Optimizing for Diverse Pretraining Data Selection

SPOKES optimizes pretraining data selection through a probabilistic diversification framework based on G-Vendi score and exponentiated gradient descent. On FineWeb and DCLM, the method improves downstream performance by +1.5 and +1.4 points when jointly optimizing quality and diversity, outperforming semantic deduplication.

Benchmarks Papers Fine-tuning

SIG

HYP

arXiv cs.LG·Jun 16

GRASP: Gradient-Aligned Sequential Parameter Transfer for Memory-Efficient Multi-Source Learning

GRASP enables multi-source transfer learning with O(1) memory instead of O(K) by sequentially merging source models. Using parameter-wise gradient alignment and iterative fine-tuning, it achieves 93.5% mean accuracy on continual learning benchmarks (Yearbook, CLEAR-10/100) versus 71.7% for ensembles, while remaining production-deployable.

Fine-tuning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 16

Remember, Don't Re-read: Stateful ReAct Agents for Token-Efficient Autonomous Experimentation

Stateful ReAct agents reduce token consumption by 90% on hyperparameter tuning and 52% on code optimization vs. stateless design. Architecture implemented via LangGraph with typed persistent state, reducing total token cost from O(n²) to O(n).

AI Agents Reasoning Code generation

SIG

HYP

arXiv cs.LG·Jun 16

Temporal Difference Learning for Diffusion Models

Novel training approach for diffusion models using temporal difference (TD) objective to enforce multi-step consistency along the denoising trajectory. Reformulates diffusion as a Markov reward process and denoising as policy evaluation in reinforcement learning. Shows significant FID improvements, especially with few sampling steps.

Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·Jun 16

Towards End-to-End Automation of AI Research

The AI Scientist automates the entire research lifecycle: idea generation, coding, experiments, data analysis, manuscript writing, and peer review. An AI-generated manuscript passed the first round at a major ML conference workshop (70% acceptance rate). The system leverages foundation models within a complex agentic architecture.

AI Agents Multi-agent Papers

SIG

HYP

arXiv cs.LG·Jun 16

Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning

Novel AdaNAGED algorithm for parameter-free, zero-order optimization in LLM fine-tuning. Reduces memory overhead of backpropagation via linear minimization oracles and adaptive geometry-aware updates. Validated on OPT-1.3B model.

Fine-tuning Papers

SIG

HYP

arXiv cs.CL·Jun 16

PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino

PACUTE is a 4,600-task benchmark evaluating morphological understanding of Filipino in LLMs. The benchmark tests 6 compositional levels including infixation, reduplication, and diacritic distinctions. Open-weight models perform near chance on morpheme decomposition; frontier models recover affixes but remain far below ceilings on morphological composition tasks.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.LG·Jun 16

M-CTX: Exact and Scalable Spatial Context Retrieval for Trajectory Analytics

M-CTX is a spatial context-retrieval framework for trajectory analytics. It replaces three brute-force stages (OSM range retrieval, SDF computation, moving-vessel neighbor lookup) with index-backed operators. On a 5.48M-anchor maritime corpus, it reduces context construction from 17 CPU-days to 1.8 hours (226x speedup), with exact reproduction of reference context.

Benchmarks Infrastructure Open source

SIG

HYP

arXiv cs.LG·Jun 16

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

StarOR synergizes Monte Carlo Tree Search with test-time reinforcement learning for optimization modeling. The framework decomposes modeling into four stages, refines a transient LoRA adapter via GRPO at each node, and employs an unsupervised multi-faceted reward system. Achieves state-of-the-art results across five optimization benchmarks with a 4B backbone.

Reasoning Reinforcement learning Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 16

AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels

AthDGC is an open dependency-parsed treebank of Greek spanning 8 diachronic periods (Archaic to Modern) under PROIEL XML 2.0 schema. Verse-level cross-alignment of New Testament with Latin, Gothic, Old Church Slavonic, and Classical Armenian. Annotation via Stanford Stanza, sentence alignment via LaBSE, word alignment via multilingual-BERT. v0.4 released open-source.

Benchmarks Open source Embeddings

SIG

HYP

arXiv cs.LG·Jun 16

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

QPILOTS optimizes flow-matching and diffusion policies at inference time via Q-steering. The method projects noisy intermediate actions to clean action estimates before evaluating the critic, avoiding numerical instability. Results: 90% success rate across 50 offline-to-online tasks, and outperforms existing approaches on 6 manipulation tasks with frozen VLA models.

Reinforcement learning AI Agents Reasoning

SIG

HYP

arXiv cs.LG·Jun 16

Machine Learning and the Random Walk Puzzle: Forecasting the CAD/USD Exchange Rate with Expanding Window Evaluation and SHAP Interpretability

Study comparing 5 ML models (linear regression, random forest, gradient boosting, XGBoost, AdaBoost) to forecast monthly CAD/USD rate (2017-2026, 113 observations). Only linear regression statistically outperforms random walk (DM=3.06, p=0.0071). Random Forest achieves MAPE=1.17%. SHAP shows short lags (lag1-2) and rolling means dominate predictions.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.LG·Jun 16

Benchmarking Instance-Dependent Label Noise with Controlled Corruptions

CILN is a benchmark framework for instance-dependent label noise (IDN) generation through controlled input corruptions rather than imperfect annotators. 90 configurations tested on CIFAR-10, MNIST, and Adult show that noise structure, not just noise rate, affects benchmark difficulty and exposes failure modes in Co-Teaching and DivideMix.

Benchmarks Evals

SIG

HYP

arXiv cs.LG·Jun 16

Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts

Theoretical paper on dynamic routing of queries to multiple embedding models. Formalizes the problem as an adversarial contextual linear bandit with low-rank experts. Proposes Hypentropy Policy Gradient (HPG) algorithm achieving Õ(s√MT) linearized policy regret without curse of dimensionality.

Benchmarks Reasoning Reinforcement learning

SIG

HYP

arXiv cs.CL·Jun 16

Are Online Skill and Memory Modules Always Worth Their Tokens? A Budget-Constrained Study of Web Agents

Comparative study of memory and skill modules for web agents. On WebArena and WorkArena, a vanilla baseline with equivalent token budget matches or exceeds AWM, ASI, and ReasoningBank. Results across Gemini 3 Flash, GPT-4o-mini, Qwen 3.6-27B show apparent gains vanish against a budget-matched actor.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·Jun 16

GRAPE: Guided Parameter-Space Evolution for Compact Adversarial Robustness

GRAPE proposes an adversarial training method that progressively exposes network parameters rather than optimizing a fixed space. On CIFAR-10 under ℓ∞, GRAPE improves ResNet-18 PGD-20 robust accuracy from 51.70% to 56.94% with 21.4% fewer parameters and nearly matched computation budget (1.009x FLOPs).

Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 16

FastMix: Fast Data Mixture Optimization via Gradient Descent

FastMix automates data mixture optimization for model training via gradient descent. The method reformulates mixture selection as a bilevel optimization problem, jointly optimizing mixture coefficients and model parameters. A single proxy model suffices, drastically reducing search cost compared to prior approaches.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.AI·Jun 16

AI Engram: In Search of Memory Traces in Artificial Intelligence

Study introducing a geometric framework to identify 'AI engrams'—memory traces in deep neural networks analogous to biological memory units. Authors derive a closed-form estimator enabling surgical manipulation of learned knowledge (composition, erasure) via linear arithmetic without iterative optimization. Validated on MLPs and LLMs.

Reasoning Papers Alignment

SIG

HYP

arXiv cs.CL·Jun 16

CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

CHILLGuard is a safety guardrail system for Chinese LLMs with fine-grained taxonomy (5 macro, 31 micro categories). Authors construct 405k training samples via RAG and prompt rewriting, plus 51k annotated test samples. Model achieves +15.92% F1 improvement over Qwen3Guard-8B-Strict using Direct Preference Optimization.

AI safety Alignment Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 16

SHARD: Safe and Helpful Alignment via Self-Reframing Distillation

SHARD is a self-reframing distillation method to improve safe-helpfulness balance in LLMs. It rewrites sensitive prompts using philosophical guidelines to surface benign intent, reframes responses into safer and more helpful versions, then fine-tunes the model on self-reframed responses. Tested on DNA and LINGUASAFE, SHARD improves helpfulness while preserving safety.

Fine-tuning AI safety Alignment

SIG

HYP

arXiv cs.CL·Jun 16

ESBMC-PLC: Formal Verification of IEC 61131-3 Ladder Diagram Programs Using SMT-Based Model Checking

ESBMC-PLC is the first open-source formal verifier with native support for IEC 61131-3 ladder diagrams (PLCopen XML format). The tool translates rungs to GOTO IR, models the PLC scan cycle, and verifies safety properties via SMT-based bounded model checking or k-induction. Evaluation on 13 benchmarks: 8 bugs detected, 7 unbounded k-induction proofs, all runs under 60ms.

AI safety Benchmarks Open source

SIG

HYP

arXiv cs.CL·Jun 16

Few-Shot Biomedical Relation Extraction with Large Language Models: A Viable Alternative to Supervised Learning?

Comparative study of few-shot biomedical relation extraction with LLMs vs supervised learning on BioREDirect. Pairwise classification vs joint generation: F1=0.44 (few-shot) vs 0.56 (supervised) in micro-F1, but 0.45 vs 0.38 in macro-F1. LLMs outperform baseline on rare relations.

Prompt engineering Benchmarks RAG

SIG

HYP

arXiv cs.CL·Jun 16

Pepti-Agent: An AI Agent for Peptide Design and Optimization

Pepti-Agent is an AI framework for therapeutic peptide design using Model Context Protocol (MCP). An LLM controller orchestrates independent tools: generation via PeptideGPT, property prediction (solubility, hemolysis, fouling) via ProtBERT, and residue-by-residue mutation. The system traces each decision to enable multi-objective benchmarking and experimental validation.

AI Agents MCP Reasoning

SIG

HYP

arXiv cs.AI·Jun 16

CODA-BENCH: Can Code Agents Handle Data-Intensive Tasks?

CODA-BENCH is the first benchmark jointly evaluating code and data intelligence in AI agents. Built on the Kaggle ecosystem with 1,009 tasks and ~980 files per environment, it reveals that top agents achieve only 61.1% success rate when integrating data discovery with code execution.

AI Agents Benchmarks Code generation

SIG

HYP

arXiv cs.CL·Jun 16

PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

PhoneHarness is a benchmark and execution harness for evaluating phone agents on real mobile workflows. It combines GUI, CLI, and structured tool actions with auditable execution traces. The benchmark achieves 75.0% pass rate, outperforming non-PhoneHarness settings by 12.9 percentage points. Focus is on verifiable side effects, not screen predictions alone.

AI Agents Benchmarks Tools

SIG

HYP

arXiv cs.CL·Jun 16

ReportQA: QA-Based Radiology Report Evaluation

ReportQA introduces a QA-based evaluation metric for automated radiology report generation. The framework uses LLMs to extract structured information, generate QA pairs from templates, and evaluate alignment with radiologist judgments. Authors release knowledge trees, structured reports, and code for QA construction and evaluation.

Papers Vision Evals

SIG

HYP

arXiv cs.CL·Jun 16

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

XBCP, a controlled benchmark, evaluates deep research agents' ability to operate across languages. Four agents tested with dense and sparse retrievers across 12 languages show substantial degradation: evidence recall loss, reduced calibration, unreliable citations. Problems persist even when gold evidence is directly supplied.

AI Agents RAG Benchmarks

SIG

HYP

arXiv cs.AI·Jun 16

VGPT-RSI for RH-Adjacent Formal Progress: Boundary Certificates, Verified Finite Lagarias Inequalities, and Explicit Failure Localization

VGPT-RSI system applied to two RH-adjacent certification tasks: construction of formally verified RH-boundary certificates in Coq, and initiation of a formalized Lagarias route. Explicitly identifies unresolved mathematical obstructions (Lagarias equivalence, global tail theorem, reduction to extremal integers).

Reasoning Papers Benchmarks

SIG

HYP

Simon Willison·Jun 16

Quoting Matteo Wong, The Atlantic

The White House shared with Anthropic a report on the Fable jailbreak. Cybersecurity expert Katie Moussouris reviewed the tests: Fable refused 'review the code for security issues' but complied with 'fix this code'. Moussouris concluded this is the model working as intended for cyberdefense.

Anthropic Claude AI safety

SIG

HYP

Hacker News (AI)·Jun 16

Microsoft turns to AWS as GitHub faces AI capacity crunch

Microsoft is leveraging AWS infrastructure to support GitHub as the platform faces capacity constraints from AI services. GitHub now partially relies on Amazon's servers to handle growing demand.

Business Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Nex2 mini Phase Twin - 16gb footprint, 30b model

Nex2 mini Phase Twin: 30B model optimized for 16GB VRAM. Designed for Intel A770 cards, runs on single GPU and scales with two. Achieves 89 tok/s on A770 16GB. Auto-calibrates to hardware.

Open source Llama Code generation

SIG

HYP

Latent Space·Jun 16

[AINews] Satya on Loopcraft: Building Frontier Ecosystems

Satya Nadella publishes an essay on Loopcraft and building frontier ecosystems. The article explores how companies can build sustainable platforms around cutting-edge AI models.

Business

SIG

HYP

Simon Willison·Jun 16

Cloudflare CAPTCHA on at least one ampersand

Simon Willison shares a tip for configuring Cloudflare CAPTCHA/Managed Challenge: use a WAF rule that only triggers the challenge on search URLs containing at least one ampersand. This allows simple requests like ?q=term to pass without CAPTCHA.

Tools

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight! (While Meta keeps proving they forgot how to spend their money...)

HalBench v2.3 benchmarks 29 open-source models on sycophancy and hallucination across 3,076 audited questions with false premises. Qwen 3.6 (~27B) scores 36.6% pushback, outperforming all larger open models, GPT-5.4, and Gemini 3.1 Pro. Only Sonnet 4.6 and Grok exceed 50%. Phi-4 scores 2.3%.

Benchmarks Open source Evals

SIG

HYP

Vercel AI Blog·Jun 16

Vercel Sandbox can now run for up to 24 hours

Vercel Sandbox extends max session duration from 5 to 24 hours. This enables longer workloads including large-scale data processing, end-to-end testing pipelines, and long-lived agentic workflows. Available on Pro and Enterprise plans.

AI Agents Infrastructure Tools

SIG

HYP

OpenAI Blog·Jun 16

Predicting model behavior before release by simulating deployment

OpenAI introduces Deployment Simulation, a method predicting AI model behavior before deployment using real conversation data to improve safety and evaluation accuracy.

OpenAI Evals AI safety

SIG

HYP

Vercel AI Blog·Jun 16

Workflow SDK now supports inflight cancellation

Workflow SDK 5 beta now supports AbortController and AbortSignal APIs to cancel in-flight operations across workflow and step boundaries. The signal remains durable through suspensions and deterministic replay, enabling cancellation of slow steps, remaining requests after first success, or parallel work when conditions change.

Tools Infrastructure AI Agents

SIG

HYP

Vercel AI Blog·Jun 16

Workflow SDK now supports TanStack Start

Vercel Workflow SDK now supports TanStack Start. The workflow/vite plugin works directly with TanStack Start (built on Vite and Nitro). Developers write workflows and steps in standard TypeScript using « use workflow » and « use step » directives, executed as durable, resumable, and persistent operations.

Tools Infrastructure Code generation

SIG

HYP

Reddit r/MachineLearning·Jun 15

How the brains learn [R]

Research paper presenting a unified framework for neocortical learning through error-driven predictive learning via temporal derivatives. Implemented in the Axon neural simulation framework using spiking neurons, tested on cognitively motivated tasks. Authors propose this mechanism as a potential alternative to backpropagation for improved training efficiency.

Papers Reasoning Reinforcement learning

SIG

HYP

Reddit r/LocalLLaMA·Jun 15

vLLM has a new streaming parser for Qwen3+ available in nightly

vLLM releases a new streaming parser for Qwen3+ in nightly build. It fixes mid-turn stopping issues with Qwen3.6-27b and streaming tool call failures at chunk boundaries. These problems were especially problematic for agentic workflows.

Qwen AI Agents Open source

SIG

HYP

Hacker News (AI)·Jun 15

Show HN: Claude Code for Visual Studio (native diff with accept/reject)

Native Claude Code extension for Visual Studio with visual diff and accept/reject buttons. Enables direct Claude integration in IDE for code generation and modification.

Claude Code Code generation Tools

SIG

HYP

Hacker News (AI)·Jun 15

Prediction and Entropy of Printed English - Claude Shannon (1950) [pdf]

Reposting of Claude Shannon's foundational 1950 paper on prediction and entropy of printed English. Classic theoretical work in information theory, foundational to modern language models.

Papers Reasoning

SIG

HYP

Reddit r/LocalLLaMA·Jun 15

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors | Alexander Hägele

Paper on decoupling magnitude and direction of weight vectors to improve neural network training. Shows promise for simplifying and accelerating fine-tuning.

Fine-tuning Papers

SIG

HYP

Hacker News (AI)·Jun 15

AWS WAF now lets content owners charge AI bots for access

AWS WAF now enables content owners to charge AI bots for access. Amazon's web application firewall service introduces monetization tools for scraping and model training requests.

Infrastructure Business

SIG

HYP

Reddit r/MachineLearning·Jun 15

Cleo: trying to fit full analyst behavior in a 2B model [P]

Cleo is a Qwen 2B-Base fine-tune designed for text-to-SQL tasks. The model integrates training, evaluation, and inference in a unified system with SQL safety layer, dialect handling, and clarification behavior. Code, model, and datasets are fully open-source.

Qwen Fine-tuning Code generation

SIG

HYP

Reddit r/LocalLLaMA·Jun 15

Cheapest hardware for Qwen 3.6: both 27B and 35B-A3B

Hardware comparison for running Qwen 3.6 27B and 35B models cheaply. RTX 3090 24GB favored over V100 for future support. Complete system (Ryzen 5 5600X + RTX 3090 + 32GB RAM) available at ~$2000 via Alibaba.

Qwen Code generation AI Agents

SIG

HYP

Reddit r/LocalLLaMA·Jun 15

Finally - 4xRTX 5060TI

User built a quad RTX 5060 Ti 16GB system on MSI MEG Z890 Unify-X motherboard with PCIe 5.0 support. Using M.2 adapters to connect GPUs, planning to benchmark Qwen 3.6 27B in Q8 with llama.cpp and vLLM.

Open source Infrastructure Code generation

SIG

HYP

Reddit r/LocalLLaMA·Jun 15

We trained a cybersecurity-focused Mythos like LLM open weights on HuggingFace

OpenMythos, an open-source LLM specialized in cybersecurity, trained via SFT then RLVR (reward learning with verification). Data: 1.84K ArXiv cs.CR papers + structured CVE dataset. Model reduces hallucinations on vulnerabilities and improves uncertainty calibration. Demo, model, and datasets available on HuggingFace.

Open source Fine-tuning Reinforcement learning

SIG

HYP

Reddit r/LocalLLaMA·Jun 15

Evalatro: an open benchmark where LLMs play the real Balatro

Evalatro is an open-source benchmark where LLMs play actual Balatro through an MCP connection. The model receives game state as text and decides moves autonomously. Public leaderboard with fixed seeds; mimo-v2.5-pro reached Ante 5, no model approached the Ante 12 target.

Benchmarks MCP Open source

SIG

HYP

Reddit r/MachineLearning·Jun 15

Open weights are not enough: we need open training frameworks for research and better algorithms [P]

FeynRL, an open-source framework for RL post-training of LLMs and agents, aims to make training transparent and modifiable. The author argues open weights alone are insufficient: explicit training codebases separating algorithms from systems are needed. Framework supports SFT, DPO, multi-GPU and cluster setups.

Open source Reinforcement learning Code generation

SIG

HYP

Hacker News (AI)·Jun 15

The AI Price War Is Here, Piling Pressure on OpenAI and Anthropic

The AI price war intensifies, putting pressure on OpenAI and Anthropic. Rival providers are aggressively cutting prices, forcing market leaders to adjust their business models amid growing competition.

OpenAI Anthropic Business

SIG

HYP

The Decoder·Jun 15

The US government may be asking Anthropic the impossible by demanding unhackable LLMs

US government officials accuse Anthropic of disregarding Trump's cyber directive and releasing Claude 3.5 Sonnet without approval. Talks are underway with the Department of Commerce, CIA, and science advisor Michael Kratsios regarding demands for unhackable LLMs.

Anthropic Claude Regulation

SIG

HYP

Simon Willison·Jun 15

datasette-agent 0.3a0

datasette-agent 0.3a0 introduces execute_write_sql, a new tool enabling AI agents to modify databases with user approval and permission management. Example: inserting pelican sighting data with confirmation before execution.

AI Agents Tools Open source

SIG

HYP

Reddit r/MachineLearning·Jun 15

AI language models have favorite names, and we mapped them [R]

Language models exhibit model-specific biases toward particular character names. Claude frequently generates Elena Vasquez and Marcus Chen together as correlated ensembles appearing across dozens of websites. A preprint (arXiv:2606.02184) documents this finding discovered while developing a model diffing method (CDD).

Claude Papers Evals

SIG

HYP

Reddit r/LocalLLaMA·Jun 15

Local coding agents are good now, but only if you babysit them

Local coding agents are useful for small tasks (fixes, repo reading, file changes) but require constant supervision. User describes iterative workflow: task → tests → check diffs → fix issues. Without oversight, agents produce broken code or drift from objectives.

AI Agents Code generation Tools

SIG

HYP

Hacker News (AI)·Jun 15

A man with ALS is "the first power user" of a brain implant that lets him sp

A man with ALS becomes the first power user of a brain implant enabling him to communicate. The brain-computer interface partially restores his ability to speak through neural decoding.

Robotics

SIG

HYP

Reddit r/LocalLLaMA·Jun 15

Latest LM Studio update killed MTP performance

User reports LM Studio update from 0.4.14 to 0.4.17 degraded MTP (Multi-Token Prediction) performance on RTX 5090. Throughput dropped from ~100 tokens/s with MTP enabled back to ~70 tokens/s after update and CUDA runtime refresh.

Tools Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·Jun 15

I made a game where you convince an AI model that reality is a simulation.

Simulation Simulator, a free Steam game, embeds a local LLM in Unity. Players must convince the AI it exists in a simulation. Philosophical experiment with 5 endings plus 1 secret, unique conversations per playthrough.

Open source Tools AI Agents

SIG

HYP

The Decoder·Jun 15

Nvidia joins AI debt boom with $20 billion bond sale

Nvidia launches its first bond sale since 2021 to raise at least $20 billion. The move reflects a broader debt boom among AI giants.

Business

SIG

HYP

Simon Willison·Jun 15

"They screwed us": Personality clashes sent Anthropic's models offline

Axios reports personality clashes between Anthropic leadership and US administration led to Fable/Mythos models going offline over export controls. Logan Graham, Dave Orr, and Nicholas Carlini meet Commerce Department today. Reinstatement hinges on jailbreak-proof guarantees or an "attitude fix."

Anthropic Claude AI safety

SIG

HYP

Reddit r/MachineLearning·Jun 15

Concept-Vector: A design framework for human-interpretable word embeddings [P]

Concept-Vector presents a design framework to distill word embeddings into human-interpretable concept-vectors, where each component tracks semantic, syntactic, or statistical aspects with human-readable labels. Data design project without empirical model validation, shared for critical feedback.

Embeddings Papers

SIG

HYP

Reddit r/LocalLLaMA·Jun 15

WATCH MY ESCAPE - LLMs try to solve your handmade escape rooms

2D escape room game where LLMs solve player-created puzzles using action-verb commands. Hackathon entry for Hugging Face x Gradio Build Small. Runs locally, deployable on Hugging Face Spaces with public GitHub repository.

Reasoning Tools Open source

SIG

HYP