May 2026

3149 articles

Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study

Comparative study of 5 classifiers (logistic regression, random forest, XGBoost, SVM, naive Bayes) for chronic kidney disease risk prediction. All achieve AUROC 1.00 internally (UCI, 400 patients) but collapse on external MIMIC-IV data (AUROC 0.48-0.58). Calibration and conformal coverage severely degraded. No model meets clinical deployment criteria.

Evals AI safety

SIG

HYP

arXiv cs.CL·May 22

Psy-Chronicle:A Structured Pipeline for Synthesizing Long-Horizon Campus Psychological Counseling Dialogues

Psy-Chronicle is a data-generation framework for synthesizing long-horizon campus psychological counseling dialogues. Authors create CPCD, a Chinese dataset of 90,000 dialogues across 100 student profiles spanning a semester, with a benchmark evaluating long-horizon memory and causal reasoning. Code and data open-sourced.

Papers Benchmarks Open source

SIG

HYP

arXiv cs.CL·May 22

Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency

Self-supervised framework to align multilingual cultural knowledge in LLMs. Uses multilingual self-consistency to identify reliable cultural responses and transfer them to weaker languages. Improves BLEnD benchmark performance by average 5.03% using only self-generated data.

Prompt engineering Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 22

Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

Study of factual recall mechanisms in multimodal language models (text + speech). Using causal mediation analysis on SpiritLM, researchers show that knowledge storage and retrieval mechanisms only partially transfer from text to speech modality.

Papers Reasoning Voice

SIG

HYP

arXiv cs.LG·May 22

Equilibrium Propagation and Hamiltonian Inference in the Diffusive Fitzhugh-Nagumo Model

Extension of Equilibrium Propagation framework to skew-gradient systems with demonstrated equivalence between deep Energy-Based Models and Hamiltonian neural networks. Applied to diffusively coupled Fitzhugh-Nagumo neuron networks, showing stationary solutions admit spatial Hamiltonian structure enabling Hamiltonian Echo Backpropagation methods.

Papers Reasoning Reinforcement learning

SIG

HYP

arXiv cs.CL·May 22

Audience Engagement with Arabic Women's Social Empowerment and Wellbeing: A Decadal Corpus

Corpus of 252,487 Arabic Facebook posts (2013-2024) collected from 51,660 pages across 77 countries covering women's empowerment and social wellbeing. 267 million user interactions analyzed with engagement metrics (shares, comments, reactions). Automated pipeline for language identification, normalization, and metadata cleaning.

Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 22

ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination

ArabDiscrim is a corpus of 293K Arabic Facebook posts (2014-2024) on racism and discrimination. It includes 200 curated terms with morphological families (13+ inflections), 20 discrimination axes, and native engagement signals (reactions, shares, comments). Released under restricted research-use license for ethical compliance.

Benchmarks AI safety Alignment

SIG

HYP

arXiv cs.LG·May 22

Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition

Self-Paced Curriculum Learning (SPCL) framework for multimodal emotion recognition in conversations. Dual-level Difficulty Measurer (utterance and conversation level) guides training from easier to harder instances. IEMOCAP tests show +1.2% to +6.6% F1 improvement, MELD reaches +10.4%, addressing modality imbalance.

Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 22

From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification

Comparative study of sentiment classification models on IMDb: Naive Bayes, Logistic Regression, SVM, LightGBM, LSTM, RoBERTa, and DistilBERT. RoBERTa achieves 93.02% accuracy. Soft voting ensemble improves performance.

Benchmarks

SIG

HYP

arXiv cs.CL·May 22

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

LatentOmni proposes an audio-visual reasoning framework using unified latent space instead of explicit text chain-of-thought. The model interleaves textual reasoning with audio-visual latent states, introduces Omni-Sync Position Embedding (OSPE) for temporal consistency, and leverages LatentOmni-Instruct-35K (35K annotated trajectories). Outperforms text-based baselines on audio-visual benchmarks.

Reasoning Papers

SIG

HYP

arXiv cs.CL·May 22

Token-weighted Direct Preference Optimization with Attention

Token-weighted DPO (TwDPO) and AttentionPO introduce preference optimization that weights tokens by importance. AttentionPO uses the model's own attention to estimate weights without a separate reward model. Results: improvements on AlpacaEval, MT-Bench, ArenaHard.

Reinforcement learning Alignment Benchmarks

SIG

HYP

arXiv cs.CL·May 22

Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation

Claim-selective certification system for high-risk medical RAG. Each response decomposed into verifiable claims, scored against retrieved evidence, mapped to {full, partial, conflict, abstain}. On weak-label certificate protocol, UCCR=0.0000, PAU=1.0000, PAU Precision=0.9901, action accuracy=0.9204 (dev, n=314) and 0.8997 (test, n=319).

RAG Evals AI safety

SIG

HYP

arXiv cs.CL·May 22

ACC: Compiling Agent Trajectories for Long-Context Training

ACC converts agent trajectories (search, software engineering, database querying) into long-context QA pairs for SFT training. Removes tool response masking and creates explicit supervision over distant dependencies. Qwen3-30B-A3B achieves +18.1 on MRCR and +7.6 on GraphWalks, comparable to Qwen3-235B.

AI Agents Reasoning Fine-tuning

SIG

HYP

arXiv cs.CL·May 22

Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild

Hy-MT2 is a family of multilingual translation models (1.8B, 7B, 30B-MoE) supporting 33 languages. The 1.8B model quantized at 1.25-bit weighs 440 MB and improves inference speed by 1.5x. The 7B and 30B models outperform DeepSeek-V4-Pro and Kimi K2.6 in fast-thinking mode; the 1.8B surpasses commercial APIs from Microsoft and Doubao.

Benchmarks Code generation DeepSeek

SIG

HYP

arXiv cs.CL·May 22

When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

OGCaReBench is a retrieval-focused benchmark evaluating LLMs on off-guideline clinical questions extracted from published medical case reports. GPT-5.2 achieves 56% without retrieval, 82% with retrieved medical articles. Specialized models reach only 42%.

Benchmarks RAG Reasoning

SIG

HYP

arXiv cs.CL·May 22

Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions

Study across 6,620 runs showing Claude Haiku compresses 10 English intensity modifiers into 5 distinct outputs. System state context dominates lexical effect (explained variance: 0.782 vs 0.079). Near operational boundaries, model exhibits three modes: small adjustments for weak words, abstention for strong words, ceiling-pushing for 'drastically'.

Claude Evals Reasoning

SIG

HYP

arXiv cs.CL·May 22

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

RankJudge is a benchmark generator for evaluating LLMs-as-judges on multi-turn conversations grounded in reference documents. The system creates conversation pairs with a single flaw injected into one turn, enabling unambiguous labeling. Evaluation of 21 frontier LLM judges ranked via Bradley-Terry model across machine learning, biomedicine, and finance domains.

Evals Benchmarks Multi-agent

SIG

HYP

arXiv cs.CL·May 22

Probabilistic Attribution For Large Language Models

Novel probabilistic token attribution method for LLMs using conditional probabilities and Bayes rule to invert next-token log-probabilities. Captures model's internal representation of token sequence distribution independent of computational structure. Evaluates 8 models across 7 prompts to analyze token sensitivity, response stability, and training convergence.

Papers Reasoning Evals

SIG

HYP

arXiv cs.CL·May 22

PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts

PromptNCE estimates pointwise mutual information via LLM without training, using only prompts and elicited probabilities. The method frames conditional probability estimation as a contrastive task with explicit OTHER category. Spearman correlation up to 0.82 on three datasets with human ground-truth.

Prompt engineering Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 22

CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety

CR4T is a safety framework for adolescent-facing LLMs. Instead of refusing problematic requests, it rewrites unsafe responses into developmentally appropriate guidance. Combining lightweight risk detection with domain-conditioned rewriting, CR4T reduces unnecessary refusals while preserving benign intent.

AI safety Alignment Papers

SIG

HYP

arXiv cs.CL·May 22

Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews

Sem-Detect detects AI-generated peer reviews by analyzing textual features and semantic analysis at claim level. The method compares a target review against multiple AI-generated reviews of the same paper, exploiting AI model convergence versus human reviewer diversity. On 20,000+ ICLR/NeurIPS reviews, Sem-Detect improves strongest baseline by 25.5% in TPR@0.1% FPR.

Evals AI safety Papers

SIG

HYP

arXiv cs.AI·May 22

FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation

FlowLM converts pre-trained diffusion language models into flow matching models via efficient fine-tuning. By realigning curved diffusion trajectories into straight-line flows, FlowLM achieves high-quality few-step text generation rivaling 2,000-step diffusion sampling. Performance saturation reached with half the training epochs compared to training from scratch.

Code generation Reasoning Papers

SIG

HYP

arXiv cs.AI·May 22

Evaluating multimodal emotion recognition in proactive conversational agents: A user study

Empirical study (20 users) of a multimodal conversational agent with emotion recognition. Computer vision and linguistic analysis integrated. Key finding: text analysis outperforms facial recognition (poker face effect). Miscalibrated proactivity reduces user engagement.

AI Agents Vision Evals

SIG

HYP

arXiv cs.AI·May 22

Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum

arXiv paper on data scaling laws: progressive coverage of a latent predictive contribution spectrum (via suffix-automaton representation) strongly correlates with empirical scaling exponent. Across 12 real corpora, log K(N) shows near-linear relationship with log N (R²≈0.96), suggesting training advances an effective frontier through a predictive state spectrum.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.CL·May 22

Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

Faithful-MR1 is a training framework for MLLMs improving multimodal reasoning via reinforcement learning. It anchors visual attention directly on image regions (not via textual descriptions) and reinforces faithful use through counterfactual image intervention. Results on Qwen2.5-VL-Instruct 3B/7B with substantially less training data.

Reinforcement learning Vision Reasoning

SIG

HYP

arXiv cs.AI·May 22

Mind the Sim-to-Real Gap & Think Like a Scientist

Theoretical work on balancing pre-trained simulators with real experiments in sequential decision-making. Decomposes simulator error into calibration-deployment shift and parametric residual. Proposes Fisher-SEP, an experimental policy minimizing posterior predictive variance. Case studies: vending-machine supply chain and HIV mobile testing.

Reinforcement learning Reasoning Papers

SIG

HYP

arXiv cs.AI·May 22

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

DeepWeb-Bench is a deep research benchmark evaluating 9 frontier models on tasks requiring massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. Errors stem primarily from derivation and calibration (>70%), not retrieval (12-14%). Strong and weak models fail differently: incomplete derivation vs hallucinated precision.

Benchmarks Reasoning AI Agents

SIG

HYP

arXiv cs.AI·May 22

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

ScenePilot generates critical scenarios for autonomous driving testing via multi-objective reinforcement learning. The framework combines RSS-derived physical feasibility with an AV-risk predictor to target boundary-band scenarios: physically solvable yet causing failures. Results: +6.2 percentage points collision rate on SafeBench while preserving physical validity.

Reinforcement learning AI safety Evals

SIG

HYP

arXiv cs.AI·May 22

Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning

DIP (Diverge-to-Induce Prompting) generates multiple diverse rationales per question, elaborates them into detailed step-by-step plans, then induces them into a final plan. Improves zero-shot reasoning accuracy without resource-intensive sampling vs single-strategy prompting.

Prompt engineering Reasoning Papers

SIG

HYP

arXiv cs.AI·May 22

Towards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G

BlueSky vision for native AI integration in 6G: shift from "Network for AI" to "AI for Network" paradigm. Proposes unified foundation model orchestrated by multi-agent systems to manage networks as multi-modal multi-task optimization problem, with knowledge distillation for edge deployments and autonomous network diagnosis/maintenance.

Multi-agent AI Agents Reasoning

SIG

HYP

arXiv cs.AI·May 22

Governance by Construction for Generalist Agents

CUGA introduces a modular governance system for generalist LLM agents in enterprise settings. Through five enforcement checkpoints (Intent Guard, Playbook, Tool Guide, Tool Approvals, Output Formatter), the platform enforces policies without model fine-tuning, ensuring compliance and auditability across compound workflows.

AI Agents AI safety Alignment

SIG

HYP

arXiv cs.AI·May 22

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

PlanningBench is a framework for generating scalable and verifiable planning data. It abstracts 30+ task types and difficulty factors from real scenarios, then synthesizes problems with adaptive control and automatic verification. RL training on verified data improves performance on unseen benchmarks.

Benchmarks Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·May 22

For How Long Should We Be Punching? Learning Action Duration in Fighting Games

RL study on fighting games (Street Fighter II). Agents learn to predict both action and execution duration instead of deciding at fixed intervals. FightLadder experiments: learned timing matches fixed frame skip performance but encourages repeatable exploitable action patterns.

Reinforcement learning AI Agents Papers

SIG

HYP

arXiv cs.AI·May 22

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

Students construct QuestBench, a 256-question benchmark across humanities and social sciences, to evaluate deep research systems. Testing reveals GPT-4.5 reaches 57.58% pass rate while mean performance is 16.85% across 13 systems, exposing hidden failures. This classroom practice teaches students to judge AI output quality and remain responsible knowledge actors.

Benchmarks Evals GPT

SIG

HYP

arXiv cs.AI·May 22

Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

DDS (Declarative Data Services) is an architecture for structured agentic discovery of data-system compositions. Addressing unbounded agentic discovery failures, the framework decomposes search into typed sub-searches via four contracts (intent, operator DAG, skills, runtime attribution). Tested on a trading-backend workload, DDS converges where unbounded approaches fail.

AI Agents Multi-agent Papers

SIG

HYP

arXiv cs.AI·May 22

From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)

Hierarchical multi-agent autonomous network architecture (HANA) shifting from static automation to agent-native intelligence in 5G environments. Dual-Driven Orchestrator coordinates specialized Executive Agents with shared Public Memory and agent self-awareness. 5G Core validation: 86% MTTR reduction, sustained throughput under congestion.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.AI·May 22

Personality Engineering with AI Agents: A New Methodology for Negotiation Research

Researchers introduce "personality engineering," a methodology using AI agents to rigorously test negotiation theories. AI agents precisely parameterize negotiator personalities along two dimensions (warmth and dominance) from the interpersonal circumplex, enabling controlled experiments impossible with humans.

AI Agents Papers Reasoning

SIG

HYP

arXiv cs.AI·May 22

Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

New inference-time method for flow models: Conflict-Aware Additive Guidance (CAR) corrects off-manifold drift when composing multiple constraints. Dynamically detects and resolves gradient conflicts. Validated on image editing, planning, and control tasks.

Reasoning Evals Code generation

SIG

HYP

arXiv cs.AI·May 22

Open-World Evaluations for Measuring Frontier AI Capabilities

New evaluation approach for frontier AI: 'open-world evaluations' complement benchmarks by testing complex real-world tasks over long horizons. CRUX project demonstrates an AI agent developing and publishing an iOS app to Apple App Store with only one avoidable manual intervention, revealing emerging capabilities.

Evals AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·May 22

High Quality Embeddings for Horn Logic Reasoning

Method for creating high-quality embeddings for Horn logic reasoning. Authors use triplet loss with three innovations: anchors with repeated terms, balanced easy/medium/hard examples, and periodic emphasis of hardest cases. Evaluation across multiple knowledge bases.

Embeddings Reasoning Papers

SIG

HYP

arXiv cs.AI·May 22

GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation

GraphDiffMed presents a medication recommendation framework using dual-scale Differential Attention v2 with pharmacological constraints. Tested on MIMIC-III, the model filters noise at intra-visit and inter-visit levels while integrating drug-drug interactions, outperforming baselines on quality and safety metrics.

Benchmarks Papers AI safety

SIG

HYP

arXiv cs.LG·May 22

I-SAFE: Wasserstein Coherence Metrics for Structural Auditing of Scientific AI Models

I-SAFE is a post-hoc auditing framework for scientific AI models based on the Wasserstein Coherence Metric (WCM). It evaluates whether model predictions reflect domain structure or exploit statistical shortcuts. Tested on drug-target interaction prediction (DeepConvDTI, DeepDTA, TAPB), it reveals distinct distributional response profiles invisible to accuracy metrics.

Evals AI safety Alignment

SIG

HYP

arXiv cs.LG·May 22

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

X-Token introduces cross-tokenizer knowledge distillation via two complementary loss formulations (P-KL and H-KL) using a projection matrix W. On Llama-3.2-1B, the method outperforms GOLD by +3.82 points with Qwen3-4B and +0.5 with Phi-4-Mini; two-teacher setup (Phi-4-mini + Llama-3B) gains +1.3 points.

Fine-tuning Benchmarks Llama

SIG

HYP

arXiv cs.LG·May 22

Correcting Class Imbalance in Prior-Data Fitted Networks for Tabular Classification

Prior-data fitted networks (PFNs) excel at tabular classification but suffer from class imbalance affecting rare classes. This study adapts classical mitigation techniques (thresholding, downsampling) to PFNs, finding thresholding outperforms due to PFN calibration properties, while downsampling provides comparable results with reduced inference cost.

Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 22

SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

SOLAR is an autonomous agent using parameter-level meta-learning to continuously adapt to non-stationary data streams. It combines multi-level reinforcement learning and episodic memory to balance plasticity and stability, outperforming baselines on common-sense, mathematical, medical, coding, social and logical reasoning tasks.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.LG·May 22

Hierarchical Variational Policies for Reward-Guided Diffusion

Hierarchical variational framework for adapting pretrained diffusion models to reward-aligned objectives. Formulates test-time adaptation as a lightweight stochastic policy that amortizes per-step control. On 4x super-resolution: better perceptual quality with 5x faster inference than best baseline.

Reinforcement learning Image generation

SIG

HYP

arXiv cs.LG·May 22

EntmaxKV: Support-Aware Decoding for Entmax Attention

EntmaxKV introduces a sparse decoding framework for entmax attention, exploiting exact zeros produced by entmax versus softmax's dense tails. Combines query-aware page scoring, support-aware candidate selection, and sparse entmax attention. Achieves 3.36× speedup (softmax) and 5.43× (entmax) on 1M context using reduced KV cache fraction.

Reasoning Benchmarks Infrastructure

SIG

HYP

arXiv cs.LG·May 22

Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction

AI text detectors amplify a pretrained typicality axis rather than construct an AI-vs-human boundary. On RoBERTa-base, raw projection onto centroid(AI)-centroid(HC3) achieves AUROC 0.806-0.944, matching or exceeding fine-tuning. A closed-form Jacobian predictor transfers to 16/16 third-party detectors with oracle-equivalence, reducing FPR by 57% on the OpenAI detector.

Evals Benchmarks AI safety

SIG

HYP

arXiv cs.AI·May 22

OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

OSCToM combines RL and surrogate models to generate observer-agent conflicts in Theory of Mind tasks. On FANToM (information-asymmetric benchmark), OSCToM-8B reaches 76% accuracy vs 0.2% for ExploreToM. Data synthesis is 6x more efficient.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.LG·May 22

Alike Parts: A Feature-Informed Approach to Local and Global Prototype Explanations

New explainable prototype framework integrating feature importance at two levels: "alike parts" for local explanations (highlights shared feature subsets between instance and prototype) and augmented global selection to promote diversity in prototype feature attributions. Experiments on 6 benchmarks show maintained or improved surrogate model fidelity.

Evals Papers

SIG

HYP

arXiv cs.LG·May 22

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

Authors show teacher-token reliability in reasoning self-distillation depends on position within trajectory, not local entropy. They propose Position-Weighted OPSD (PW-OPSD), applying increasing position weights to token supervision. On Qwen3-4B, AIME 2024/2025 improve by +1.0/+1.1 points; validation on DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think confirms gains.

Reasoning Fine-tuning Benchmarks

SIG

HYP

arXiv cs.LG·May 22

AgForce Enables Antigen-conditioned Generative Antibody Design

AgForce, an encoder-decoder architecture with GNN, addresses three failure modes in antibody design: antigen blindness, vocabulary collapse, and inability to generate antigen-specific sequences. Uses framework dropout, gated bottlenecks, hyperbolic attention, and Mixture Density Network. Improves amino acid recovery by 8% on CHIMERA-Bench.

Papers Benchmarks Code generation

SIG

HYP

arXiv cs.AI·May 22

Interaction Locality in Hierarchical Recursive Reasoning

Framework to measure whether information flow stays localized or crosses semantic boundaries in spatial reasoning. Applied to HRM and TRM (hierarchical recursive models) on Maze-Hard, Sudoku Extreme, and ARC-AGI. Activation patching reveals high-level recurrent states write locally, progressively accumulating global structure.

Reasoning Evals Papers

SIG

HYP

arXiv cs.LG·May 22

Beyond Single Slot: Joint Optimization for Multi-Slot Guaranteed Display Advertising

Joint optimization framework for multi-slot guaranteed display advertising allocation. Formulates problem as offline bipartite matching with contract roulette mechanism and Page View constraints. Online A/B tests on Meituan platform: +28.99% ARPU at 70% traffic, improved contract stability.

Business Benchmarks

SIG

HYP

arXiv cs.AI·May 22

Parallel LLM Reasoning for Bias-Resilient, Robust Conceptual Abstraction

Study proposing a parallel chunk-level processing framework for analyzing long documents with LLMs. Text is divided into semantically coherent segments processed independently, then consolidated with explicit evidence anchoring. Results: 84% reduction in omission error, 130% increase in evidence traceability, 91% reduction in unsupported claims.

Reasoning Evals Prompt engineering

SIG

HYP

arXiv cs.LG·May 22

From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

P2D, an LLM alignment framework, couples data selection with parameter-efficient fine-tuning by identifying task-critical attention heads. It mines high-affinity data and prunes 90% of parameters using these heads as a functional filter. Result: +8.3pp performance gain and 7.0x end-to-end speedup using only 10% of data and 10% of heads.

Fine-tuning Reasoning Alignment

SIG

HYP

arXiv cs.LG·May 22

DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models

DualOptim+ is an optimization framework for machine unlearning in LLMs. It uses shared base states and decoupled delta states to balance forgetting and retention objectives. An 8bit variant reduces memory overhead. Tested on fictitious/real unlearning, safety alignment, and multi-task learning.

Fine-tuning AI safety Alignment

SIG

HYP

arXiv cs.LG·May 22

A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction

yvsoucom-iterkit, a deterministic log-driven AutoML framework, optimizes medical risk prediction pipelines across 18,000+ configurations. On Pima and Stroke datasets, augmentation (0.454), model choice (0.198), and imbalance handling (0.101–0.406) are key drivers. Ensembles achieve F1 0.89–0.94 with cross-seed robustness (variability 0.023–0.026).

Benchmarks Evals Fine-tuning

SIG

HYP

arXiv cs.CL·May 22

Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance

Study of 25 embedding models on 5 MTEB tasks showing that nearest-neighbor overlap and magnitude differences in ICA strongly correlate (up to 0.97) with performance. Embedding tasks display varying degrees of linearity and reliance on local information retention.

Embeddings Benchmarks Evals

SIG

HYP

arXiv cs.LG·May 22

Double descent for least-squares interpolation on contaminated data: A simulation study

Simulation study on double descent phenomenon in linear regression with contaminated data. Authors compare least-squares interpolation (non-robust) against robust alternatives. Finding: overparametrization enables double descent, least-squares estimator outperforms robust methods despite outliers.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 22

The Attribution Impossibility: No Feature Ranking Is Faithful, Stable, and Complete Under Collinearity

Impossibility theorem: no feature ranking can be simultaneously faithful, stable, and complete under collinearity. Authors quantify the result for 4 model classes, propose DASH (Diversified Aggregation of SHAP) as resolution, and formally verify 305 Lean 4 theorems. Consequence: 68% of public datasets exhibit attribution instability.

Evals Papers AI safety

SIG

HYP

arXiv cs.LG·May 22

Discovering Entity-Conditioned Lag Heterogeneity: A Lag-Gated Neural Audit Framework for Panel Time Series

AC-GATE, a neural model with adaptive gating, discovers how different entities (countries) respond to historical signals across varying time horizons in panel time series. The framework separates predictive calibration from lag discovery, validated on synthetic data with known ground-truth lags and two real country-level panels.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 22

Tabular foundation models for robust calibration of near-infrared chemical sensing data

Benchmark of TabPFN (tabular foundation model) on 66 NIR datasets (54 regression, 12 classification tasks). Optimized TabPFN outperforms PLS, CatBoost, and CNN-1D in regression; matches Ridge in classification. Advantage diminishes on spectral outliers and extrapolation.

Benchmarks Papers Tools

SIG

HYP

OpenAI Blog·May 22

How Virgin Atlantic ships faster with Codex

Virgin Atlantic used Codex to ship its revamped mobile app on a fixed holiday travel deadline, achieving near-total unit test coverage and zero P1 defects.

Code generation OpenAI

SIG

HYP

OpenAI Blog·May 22

OpenAI named a Leader in enterprise coding agents by Gartner

OpenAI named leader in Gartner's 2026 Magic Quadrant for Enterprise AI Coding Agents. Codex recognized for innovation and enterprise-scale deployment.

OpenAI Code generation AI Agents

SIG

HYP

Hacker News (AI)·May 21

Google API keys can remain usable for up to 23 minutes after deletion

Google API keys remain active for up to 23 minutes after deletion. This propagation delay creates a vulnerability window where deleted keys can still be exploited.

AI safety Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·May 21

Comparison of Qwen 3.6 and Gemma4 (MoE and Dense models, Q4_K_M), generating a moderately complex MySQL query, only one produced acceptable results

Comparison of Qwen 3.6 (35B MoE, 27B Dense) and Gemma4 (26B MoE, 31B Dense) in Q4_K_M for generating a complex MySQL query. Only Gemma4 31B Dense produced a working, exact query. Gemma4 31B was also considerably faster than Qwen 3.6 27B, even in Q6_K.

Qwen Gemini Code generation

SIG

HYP

Reddit r/LocalLLaMA·May 21

Latest b9274 Addresses MTP VRAM leak

Commit b9274 fixes a VRAM leak in MTP (Multi-Token Prediction) models. The destroy() function failed to free speculative decoder, draft context, and draft model resources, causing memory accumulation on each sleep/resume cycle. Fix explicitly resets these components before llama_init.

Llama Code generation Infrastructure

SIG

HYP

Latent Space·May 21

Giving Agents Computers — Ivan Burazin, Daytona

Daytona, an agent execution platform, reports 74% MoM growth and 850K daily runs. The startup offers bare metal sandboxes and reinforcement learning evals for autonomous agents.

AI Agents Reinforcement learning Evals

SIG

HYP

Reddit r/LocalLLaMA·May 21

Your repo is a preference dataset: extracting taste from merge history

A technique to extract preferences from a repository's merge history. Assuming accepted revisions incrementally improve code quality, preference signals can be distilled to align AI agents with institutional practices, avoiding expensive expert annotation.

AI Agents Fine-tuning Reinforcement learning

SIG

HYP

Reddit r/LocalLLaMA·May 21

Qwen3.6 35Ba3 has changed my workflows and even how I use my computer

A Qwen 3.6 35B user describes how this local model transformed his workflow: automating DevOps tasks, code generation, natural language OS interaction. He built a complete website from WhatsApp audio transcripts using local agents to execute modification tickets in parallel.

Qwen AI Agents Code generation

SIG

HYP

Hacker News (AI)·May 21

Show HN: ANML – A machine-first markup language for the agentic web (IETF Draft)

ANML is a markup language designed for AI agents, proposed as an IETF draft. It aims to structure web content in machine-readable format to enable autonomous agents to interact with web pages more effectively.

AI Agents Tools Infrastructure

SIG

HYP

ActuIA·May 21

Anthropic loue Colossus 1 à 1,25 Md$/mois sur un parc xAI qui plafonne à 11% de capacité

Anthropic leases Colossus 1, xAI's supercomputer, for $1.25B/month through May 2029 ($40B+ total). The contract caps Anthropic's access at 11% of cluster capacity, restricting the company to a fraction of available resources.

Anthropic Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·May 21

Waiting for Qwen 3.7 open weight... The new King has arrived...

Qwen 3.7 open-weight model announced. Reddit post generates hype around the release, lacking specific technical details in the excerpt.

Qwen Open source

SIG

HYP

Simon Willison·May 21

Datasette Agent

Datasette Agent, a conversational AI assistant for Datasette, has been released. It enables asking questions about data stored in Datasette and generates charts via the datasette-agent-charts plugin. The demo runs on Gemini 3.1 Flash-Lite.

AI Agents RAG Gemini

SIG

HYP

Google DeepMind·May 21

We’re launching the Google DeepMind Accelerator program in Asia Pacific to tackle environmental risks

Google DeepMind launches an Accelerator program in Asia Pacific to develop AI solutions addressing environmental risks. The initiative provides funding, technical expertise, and Google Cloud access to local startups and researchers.

DeepMind Business AI safety

SIG

HYP

Reddit r/LocalLLaMA·May 21

Interesting paper advocates for quantized prefilling and precise decoding

Paper proposes Mix-Quant: use W4A4 quantization for prefilling (theoretical 4x speedup) but keep full precision for decoding. Prefilling tolerates quantization errors since they don't accumulate, unlike autoregressive decoding where each token affects subsequent generation.

Benchmarks

SIG

HYP

Hacker News (AI)·May 21

Multi-Stream LLMs: new paper on parallelizing/separating prompts, thinking, I/O

New paper on parallelizing LLM operations by separating prompt, reasoning, and I/O streams. Enables simultaneous processing of multiple operations to optimize resource utilization.

Papers Reasoning Infrastructure

SIG

HYP

Le Big Data·May 21

Honor Magic V6 : comment l’IA agentique et l’ingénierie de rupture réinventent le smartphone pliable

Honor unveils Magic V6 at MWC 2026 with agentic AI integration. The manufacturer positions the foldable smartphone as a breakthrough innovation rather than a gadget.

AI Agents Business

SIG

HYP

Reddit r/MachineLearning·May 21

Can liveness detection models generalise to synthetic media generation techniques they were never trained on? [D]

Production liveness detection systems rely on outdated threat models (static images, basic replays). Current synthetic media quality far exceeds historical training data. Critical question: can models trained on legacy deepfakes generalize to generation techniques that didn't exist when training data was collected?

AI safety Evals Benchmarks

SIG

HYP

The Decoder·May 21

Google checks websites for llms.txt in new agentic browsing audit

Google tests how websites handle AI agents through a new experimental 'Agentic Browsing' category in its Lighthouse analysis tool. The test includes checking for llms.txt file presence.

AI Agents DeepMind Tools

SIG

HYP

Hacker News (AI)·May 21

Starbucks scraps AI inventory tool across North America

Starbucks scraps its AI-powered inventory management tool across North America. The system, deployed to optimize stock management, failed to deliver expected results and has been withdrawn from operations.

Business Tools

SIG

HYP

Le Big Data·May 21

Warp : comment le terminal open source réinvente le code à l’ère de l’IA agentique

Warp, an open-source terminal, positions itself as a development tool reinvented for the era of AI agents. Developers are adopting assistants capable of autonomy and executing complex tasks, beyond simple code completion.

AI Agents Code generation Open source

SIG

HYP

Reddit r/LocalLLaMA·May 21

LatitudeGames/Equinox-31B · Hugging Face

LatitudeGames releases Equinox-31B, a Gemma 31B fine-tune optimized for interactive storytelling. The model blends dark adventure and slice-of-life narrative data. Available on Hugging Face in GGUF format, accessible via aidungeon.com with subscription.

Fine-tuning Open source Tools

SIG

HYP

GitHub Trending·May 21

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> ChromeDevTools /</span> chrome-devtools-mcp

Chrome DevTools MCP: a Model Context Protocol enabling AI agents to interact directly with Chrome DevTools for real-time debugging and inspection of web applications.

AI Agents MCP Tools

SIG

HYP

Simon Willison·May 21

datasette-agent-sprites 0.1a0

Release of datasette-agent-sprites 0.1a0, a Datasette Agent plugin for running commands in a Fly Sprites sandbox.

AI Agents Tools Open source

SIG

HYP

Le Big Data·May 21

Utilisateurs d’iPhone, vous pouvez maintenant précommander l’application Google AI Studio

Google AI Studio is now available for pre-order on the App Store for iPhone. The app gives iOS users access to Google's AI tools.

DeepMind Tools

SIG

HYP

Reddit r/MachineLearning·May 21

I created an LLM post-training method called RPS. Preliminary results show that it improved Qwen3-8b's program synthesis reliability. [R]

RPS is a two-stage post-training method inspired by neuroplasticity: easy data with high learning rate, then hard data with 90% reduced rate. On Qwen3-8b, RPS achieves 4% on ARC-AGI 1 and 1145/1200 error-free program executions versus 2.4% and 870/1200 for EPS (equal rate).

Qwen Fine-tuning Code generation

SIG

HYP

The Decoder·May 21

OpenAI shifts the boundary of automated reasoning with a "milestone in AI mathematics" that experts are now unpacking

OpenAI's reasoning model disproved a 1946 Erdős conjecture in unit-distance geometry using unexpected algebraic number theory tools. Fields Medalist Tim Gowers calls it "a milestone in AI mathematics."

OpenAI Reasoning Benchmarks

SIG

HYP

Hacker News (AI)·May 21

Launch HN: Runtime (YC P26) – Sandboxed coding agents for everyone on a team

Runtime (YC P26) launches a platform for sandboxed coding agents accessible to entire teams. Enables secure collaborative code execution without complex infrastructure setup.

AI Agents Code generation Tools

SIG

HYP

Le Big Data·May 21

Après avoir viré 8 000 personnes, Meta promet d’arrêter (pour l’instant)

Meta cut approximately 8,000 jobs (10% of workforce) and announced a temporary halt to layoffs. The company continues restructuring while promising near-term workforce stability.

Business

SIG

HYP

Reddit r/LocalLLaMA·May 21

For everyone that uses OpenCode / Pi - Heres your promptprocessing fix!

A PR on llama.cpp fixes repeated prompt processing when using OpenCode or Pi. The fix addresses a performance issue identified in integration with these tools.

Open source Code generation Infrastructure

SIG

HYP

Reddit r/MachineLearning·May 21

Does this idea sound fun? [R]

Researcher proposes a PoC of inference-time learning by inserting specialized experts to update sibling expert weights in MoE architecture. Reuses existing components, preliminary results show promise.

AI Agents Fine-tuning

SIG

HYP

The Decoder·May 21

Cohere open-sources its strongest model yet

Cohere releases Command A+, its most powerful language model to date, as open source under Apache 2.0 license.

Open source

SIG

HYP

Simon Willison·May 21

datasette-agent-charts 0.1a2

Release of datasette-agent-charts 0.1a2. Adds "View SQL query" buttons below rendered charts to inspect generated SQL queries.

AI Agents Tools Open source

SIG

HYP

The Decoder·May 21

Anthropic is about to become the first profitable AI lab

Anthropic is approaching its first profitable quarter with projected operating profit of $559 million on $10.9 billion Q2 revenue. Profitability is accelerated by coding tools and agentic Claude usage, which at times exceeded available compute capacity.

Claude AI Agents Code generation

SIG

HYP

The Decoder·May 21

OpenAI could file confidential IPO paperwork within days

OpenAI is preparing an IPO and could file confidential paperwork with the SEC within days, according to the Wall Street Journal.

OpenAI Business

SIG

HYP

The Decoder·May 21

SpaceX IPO filing shows billions in AI losses, a $2 trillion valuation target, and turbine spending that signals more data center conflicts ahead

SpaceX IPO filing reveals xAI losses of $6.36 billion in 2025, a $15 billion/year Anthropic compute deal, and targets a $2 trillion valuation. Musk retains 85.1% voting power through dual-class shares.

Anthropic Business Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·May 21

Agent Execution Tax: new procurement metric for browser agent benchmarks?

WebVoyager benchmark on 720 browser agent tasks: MiniMax M2.5 costs 2.3× less per successful task than Gemini 2.5 Flash. GLM-5 achieves 57.1% accuracy, Kimi K2.5 shows 0% parse retry rate. Open-weight models outperform Gemini not through intelligence but reliability. True cost exceeds per-token pricing once retries compound.

AI Agents Benchmarks Open source

SIG

HYP

Simon Willison·May 21

datasette-agent 0.1a3

Release of datasette-agent 0.1a3, an extensible AI assistant for Datasette. Alpha version enabling AI-powered interaction with databases through agents.

AI Agents Tools Open source

SIG

HYP