May 2026

3149 articles

Rover: Context-aware Conflict Resolution with LLM

Rover is a code merge conflict resolution system combining program analysis with LLMs. It introduces Multi-layer Code Property Graph (MtCPG) to capture inter-file dependencies and uses graph connectivity algorithms to create meaningful contexts. Evaluation: Rover outperforms standalone LLMs, MergeGen, and WizardMerge at character, lexical, and semantic levels.

Code generation Reasoning Tools

SIG

HYP

arXiv cs.CL·May 19

ANVIL: Analogies and Videos for Lecturers

ANVIL is a multimodal generative system automating production of analogy-based instructional animations for computer science. Given a concept definition, it generates textual analogies, compiles them into structured visual screenplays, and produces executable manim code. Evaluation combines teacher judgments and LLM-based automated screening.

Code generation Vision Evals

SIG

HYP

arXiv cs.CL·May 19

AI Slop or AI-enhancement? Student perceptions of AI-generated media for an English for Academic Purposes course

Implementation study of Google Notebook LM generating videos, podcasts, and infographics in an English for Academic Purposes course (106 students, Hong Kong). Students rated high perceived usefulness and ease of use; preference for visual/multimodal content. Positive correlation between video preference and academic performance, but higher cognitive load negatively associated with grades.

RAG Tools Evals

SIG

HYP

arXiv cs.AI·May 19

Integration of AI in Cybersecurity: Current Trends with a Focused Look at Intrusion Detection Applications

Review of AI trends in cybersecurity focused on intrusion detection. Comparative analysis of approaches using generative AI, NLP, federated learning, and explainable AI to improve interpretability and trust in detection systems.

AI safety Evals Papers

SIG

HYP

arXiv cs.AI·May 19

Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies

Novel mechanistic interpretability approach for Vision-Language-Action (VLA) robot policies. Authors propose sparse autoencoders (SAE) grounded in behavioral events rather than text contexts. Evaluation on OpenVLA and π₀.₅ across simulation and real-robot experiments, with code released.

Vision Robotics AI Agents

SIG

HYP

arXiv cs.AI·May 19

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

PluRule is a multimodal, multilingual benchmark for moderating pluralistic communities on social media. It covers 13,371 rule violations across 1,989 Reddit communities and 2,885 rules in 9 languages. State-of-the-art vision-language models, including GPT-4.5 with advanced reasoning, only marginally outperform a trivial baseline, revealing that pluralistic moderation remains a fundamental challenge.

Benchmarks Vision AI safety

SIG

HYP

arXiv cs.AI·May 19

Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation

Empirical study of RL post-training for diffusion-based code generation. Authors propose execution-free rewards (static checking) and AST-hint-conditioned sampling to overcome the "capability cliff". Static checking improves DiffuCoder from 53.9 to 67.1 on HumanEval and reduces rollout time by 9.4%.

Code generation Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

DashAttention introduces a differentiable hierarchical attention method using adaptive α-entmax transformation to select variable numbers of KV blocks. Unlike NSA and InfLLMv2, it maintains full differentiability and achieves 75% sparsity with accuracy comparable to full attention. GPU-aware Triton implementation outperforms FlashAttention-3.

Reasoning Infrastructure Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Code as Agent Harness

New perspective on agentic AI systems: code as central infrastructure. This research paper organizes a unified framework around three layers — harness interface (code connecting reasoning and action), mechanisms (planning, memory, feedback), and multi-agent scaling. Applications: coding assistants, GUI/OS automation, embodied agents, scientific discovery.

AI Agents Multi-agent Code generation

SIG

HYP

arXiv cs.AI·May 19

UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation

UCSF-PDGM-VQA is a clinical VQA benchmark for brain tumor MRI interpretation. Dataset of 2,387 QA pairs from 473 glioma studies. Evaluation of 6 VLMs: all fail on multi-sequence 3D MRI, suffer modality collapse and over-reliance on language priors.

Vision Benchmarks AI safety

SIG

HYP

arXiv cs.AI·May 19

CAM-VFD: Cross-Attention Multimodal Video Forgery Detection

CAM-VFD detects video deepfakes by analyzing cross-modal contradictions (appearance, motion, depth) using cross-attention fusion. Achieves 95.31% on GenVidBench and 93.43% on GenVideo with robustness to compression and adversarial perturbations.

Vision Benchmarks AI safety

SIG

HYP

arXiv cs.AI·May 19

A Systematic Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation

Systematic survey of deep learning architectures for 3D point cloud classification and segmentation. Addresses challenges (unordered nature, sensor noise, occlusions) and strategies (format conversion, local geometry extraction, permutation-invariant processing, self-attention). Evaluates models on standard benchmarks.

Vision Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

New Wide-Net-Casting Jailbreak Attacks Risk Large Models

arXiv paper identifies a new jailbreak attack class: "wide-net-casting" where adversaries query multiple large models simultaneously to bypass safeguards. Researchers develop a tailored jailbreak method achieving 100% success rate on unprotected models in some experiments, exposing significant safety risks.

AI safety Alignment Benchmarks

SIG

HYP

arXiv cs.AI·May 19

DynMuon: A Dynamic Spectral Shaping View of Muon

DynMuon extends Muon by replacing update M with U·Σ^p·V† using dynamic parameter p. Theory shows positive p accelerates signal contraction early by emphasizing high-curvature directions, while mildly negative p reallocates update strength to low-curvature directions late in training. DynMuon reduces steps to target loss by 10.6-26.5% versus Muon.

Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

SEMA-RAG is a multi-agent framework for retrieval-augmented generation applied to medical reasoning. It decomposes the process into three specialist agents: clinical interpretation, iterative document exploration, and evidence adjudication. Tested on 5 benchmarks and 5 LLM backbones, it improves baselines by +6.46 accuracy points on average.

Multi-agent RAG Reasoning

SIG

HYP

arXiv cs.AI·May 19

Visual Timelines of Police Encounters in Body-Worn Camera Footage: Operational Context and Activity Cataloging for Training and Analysis in OpenBWC

Approach to process body-worn camera (BWC) video into 10-second windows labeled by operational context and motion intensity. Models trained with CLIP and optical flow: 78.75% accuracy for context, 88.33% for activity. Privacy-conscious protocol to speed up incident review and officer training workflows.

Vision Benchmarks AI safety

SIG

HYP

arXiv cs.AI·May 19

Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation

A PPE framework uses one-class density estimators with fused text embeddings to detect contextual data leakage in RAG systems. The T3+OCSVM detector achieves 0.93+ AUROC, reduces false positives by 44-55 percentage points, and maintains millisecond latency, outperforming supervised MLP classifiers and 14B-parameter LLM judges.

RAG AI safety Embeddings

SIG

HYP

arXiv cs.AI·May 19

When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited

Behavior Foundation Models (BFMs) enable scalable imitation learning but fail under dynamics shifts (friction, actuation, noise). This paper formulates BFM task-inference as robust minimax optimization, enabling adaptation to worst-case dynamics perturbations without retraining. The framework outperforms standard BFM and robust offline IL baselines under dynamics shifts.

Reinforcement learning Papers Evals

SIG

HYP

arXiv cs.AI·May 19

The IsalProgram Programming Language

IsalProgram is a regular assembly-like language where every finite string is a valid program. Executed on a virtual machine with circular doubly linked list, it eliminates memory addresses and variable names. Authors prove its regularity and explore its potential for neural program synthesis.

Code generation Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

Adversarial Fragility and Language Vulnerability in Clinical AI: A Systematic Audit of Diagnostic Collapse Under Imperceptible Perturbations and Cross-Lingual Drift in Low-Resource Healthcare Settings

Systematic audit of two critical vulnerabilities in clinical AI: adversarial fragility and cross-lingual drift. On CheXNet (DenseNet121), accuracy collapses from 89.3% to 62.0% under imperceptible FGM perturbation (epsilon=0.021). Llama3.1:8b and NatLAS show major degradation on Nigerian Pidgin and Yoruba (80%→65%, 85%→55%). Standard defenses fail.

AI safety Alignment Evals

SIG

HYP

arXiv cs.CL·May 19

Readers make targeted regressions to plausible errors in reanalysis of "noisy-channel garden-path" sentences

Psycholinguistic study of reading dynamics for noisy-channel garden-path sentences. Readers make targeted eye-movement regressions toward regions likely containing errors, confirming a noisy-channel processing model with reanalysis.

Reasoning

SIG

HYP

arXiv cs.AI·May 19

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

SkillTTA synthesizes task-specific textual skills by retrieving relevant training trajectories, with adaptation through context only—no parameter updates. Evaluated on SpreadsheetBench, ALFWorld, and BigCodeBench: Pass@1 improves from 0.397 to 0.505 on SpreadsheetBench, from 0.517 to 0.651 on BigCodeBench.

AI Agents Prompt engineering Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Extending Pretrained 10-Second ECG Foundation Models to Longer Horizons

Method to extend ECG foundation models (pretrained on 10-second segments) to longer and variable-length recordings. A lightweight plug-in module adds long-sequence processing and temporal modeling without retraining the backbone. Results on multiple long-horizon ECG tasks outperform sliding-window and pooling baselines.

Papers Fine-tuning Vision

SIG

HYP

arXiv cs.AI·May 19

Latent Action Control for Reasoning-Guided Unified Image Generation

LAC (Latent Action Control) makes reasoning actionable in unified generative models by representing planning and diagnosis as continuous hidden actions. Integrated into BAGEL-7B-MoT, LAC improves compositional and knowledge-grounded generation via variational alignment and GRPO, with major gains on spatial relations and attribute binding.

Image generation Reasoning Code generation

SIG

HYP

arXiv cs.CL·May 19

STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

STT-Arena is a benchmark of 227 interactive tasks measuring LLM ability to replan under spatio-temporal dynamics. Claude-4.6-Opus achieves under 40% accuracy. Authors identify three recurring failure modes and propose STT-Agent-4B combining iterative trajectory refinement with online RL.

AI Agents Benchmarks Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

RTPurbo transforms LLMs into sparse models in ~100 training steps. The approach exploits three observations: only certain heads require full attention, long-range retrieval uses a 16D subspace, and dynamic top-p selection outperforms fixed top-k. Results: 9.36× prefill speedup at 1M tokens, 2.01× decode speedup, accuracy preserved.

Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks

Metric-guided fusion approach combining complementary features from visual foundation models (SAM2, DINOv3) for dense prediction tasks. Two label-free metrics (Structural Coherence, Edge Fidelity) assess encoders and select complementary pairs. Consistent performance gains across multiple tasks without complex architectural changes.

Vision Benchmarks Open source

SIG

HYP

arXiv cs.AI·May 19

Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning

XDiffuser combines state-space graph planning with diffusion to improve long-horizon planning. The model first computes a classical plan serving as a lightweight connectivity oracle, then uses it to guide denoising of a single trajectory. Outperforms diffusion baselines on long-horizon tasks, multi-agent coordination, and TSP-style reasoning.

Reasoning

SIG

HYP

arXiv cs.AI·May 19

When Fireflies Cluster; Enhancing Automatic Clustering via Centroid-Guided Firefly Optimization

Firefly Algorithm variant for automatic clustering. Introduces centroid movement strategy and multi-objective fitness function (compactness, separation, TSP-based penalty). Automatically estimates optimal cluster count. Outperforms K-Means on robotic sensor networks.

Robotics

SIG

HYP

arXiv cs.AI·May 19

PhysioSeq2Seq: A Hybrid Physiological Digital Twin and Sequence-to-Sequence LSTM for Long-Horizon Glucose Forecasting in Type 1 Diabetes

PhysioSeq2Seq combines patient-specific physiological digital twin modeling with Seq2Seq LSTM for 240-minute glucose forecasting in type 1 diabetes. Trained on 348 participants (T1DEXI), evaluated on 74: MAE 39.28 mg/dL at 240-min horizon, reducing bias by 13.89 mg/dL vs recursive LSTM.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

VLMs struggle with planning from complex visual inputs. This paper proposes Pattern Induction, an online inductive learning strategy that discovers and optimizes reusable visual patterns as composite experts. Pattern Inference enables VLMs to recognize these patterns and directly infer world model structures. Evaluated on FrozenLake, Crafter, and CubeBench.

Vision Reasoning Papers

SIG

HYP

arXiv cs.CL·May 19

Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models

New AG-MG parallel corpus with 132,481 sentence pairs for Ancient-to-Modern Greek translation. Creation pipeline combines web-scraping, VecAlign alignment with fine-tuned LaBSE embeddings, and Gemini 2.5 Flash LLM-based correction. Benchmark of NMT models (NLLB, M2M100) and Greek LLM (Llama-Krikri-8B): full fine-tuning achieves 13.16 BLEU, gains up to +10.3 points.

Benchmarks Fine-tuning Embeddings

SIG

HYP

arXiv cs.AI·May 19

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

Unified study of LLM distillation showing SFT, DAgger, offline RL, and OPD decouple two orthogonal axes: prefix source and token-level KL direction. Authors propose KL mixing and entropy-gated length curriculum, improving Pass@k by 5.8 points and reducing average response length by 3x on math reasoning.

Fine-tuning Reinforcement learning Reasoning

SIG

HYP

arXiv cs.CL·May 19

From BERT to T5: A Study of Named Entity Recognition

Comparative study of BERT (encoder-only) and T5 (seq2seq) for named entity recognition (NER). BERT uses a classification head with weighted cross-entropy; T5 fine-tuned with few-shot prompts. Evaluation on 7-class and simplified 3-class tag schemes, including ablation study and error analysis.

Fine-tuning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

AgentKernelArena is an open-source benchmark for evaluating AI coding agents on GPU kernel optimization. It contains 196 tasks (HIP-to-HIP, Triton-to-Triton, PyTorch-to-HIP) and tests generalization to unseen configurations. Cursor Agent, Claude Code, and Codex Agent achieve speedups up to 6.89x, but PyTorch-to-HIP optimizations show correctness drops on unseen configurations.

AI Agents Code generation Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Cross-modal Affinity-aligned Multimodal Learning Analytics for Predicting Student Collaboration Satisfaction in Game-Based Learning

AAMLA, a multimodal learning analytics framework, predicts student collaboration satisfaction in game-based educational environments. The CAMA module aligns modalities (gaze, action units, pose) via affinity matrices and contrastive learning, adaptively suppressing uninformative modalities. Tests on 50 middle school students in EcoJourneys show improvement over unimodal baselines.

Vision Multi-agent Evals

SIG

HYP

arXiv cs.AI·May 19

Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows

Causely is a causal intelligence layer for SRE workflows that structures environment topology and causal dependencies. Benchmark across 4 agent configurations (Claude Code, OpenAI Codex, HolmesGPT): with Causely, mean time-to-diagnosis reduced 63%, token consumption -60%, tool calls -78%, API cost per run -57%, root-cause accuracy 75%→100%.

AI Agents Benchmarks Claude Code

SIG

HYP

arXiv cs.CL·May 19

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

SkillsVote is a lifecycle-governance framework for agent skills from collection to evolution. It profiles a million-scale open-source corpus for quality and verifiability, then decomposes trajectories into skill-linked subtasks. Results show +7.9pp improvement on Terminal-Bench 2.0 (GPT-5.2) and +2.6pp on SWE-Bench Pro.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.AI·May 19

Encoding Robust Topological Signatures for Hyperdimensional Computing

Hyperdimensional computing method using topological signatures (holes, RTS-invariant Zernike moments) to improve robustness against rotations, noise, and occlusions. Experiments on MNIST/EMNIST show substantial robustness gains over naive HD baseline and competitive accuracy with compact CNNs.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

A Holistic Method for Superquadric Fitting Using Unsupervised Clustering Analysis

Novel method for fitting superquadrics to noisy point clouds with outliers. Reformulates the problem as unified unsupervised clustering, enabling fitting of both rigid and deformable superquadrics. Provides closed-form analytical solutions and convergence guarantees.

Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

CANSURF: An ASV-View Can Dataset and Benchmark for Detection and Tracking of Surface-Level Debris

CANSURF: new dataset of ~7.3k annotated images (expanded to ~57k via augmentation) for detecting and tracking surface-level debris (aluminum cans). YOLOv11 trained on CANSURF outperforms generic datasets by 12x. YOLOv11+ByteTrack provides best tracking stability; YOLOv11+SAHI improves far-field recall.

Benchmarks Vision Code generation

SIG

HYP

arXiv cs.AI·May 19

Exploring Lightweight Large Language Models for Court View Generation

Systematic study of lightweight LLMs (<2B parameters) for Criminal Court View Generation and charge prediction. Development of CVGEvalKit, an evaluation framework with 3 public datasets. Comparison of architectures, model sizes, and direct vs. indirect prediction approaches.

Benchmarks Code generation

SIG

HYP

arXiv cs.AI·May 19

UniER: A Unified Benchmark for Item-level and Path-level Exercise Recommendation

UniER is a unified benchmark for personalized exercise recommendation, comparing two paradigms: ILER (item-level) and PLER (path-level). The framework introduces Weighted Cognitive Gain (WCG) metric and evaluates 18 methods across 9 datasets. Results show systematic dominance of PLER and reveal ILER's pedagogical failures under extreme sparsity and noise.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.AI·May 19

Genflow Ad Studio: A Compound AI Architecture for Brand-Aligned, Self-Correcting Video Generation

Genflow is a Compound AI System for brand-aligned video generation. It combines a retrieval-based 'Brand DNA' extraction module with an Adversarial Multi-Agent Quality Control loop. The system iterates between generator and evaluator agents until deterministic consensus, improving brand-compliant yield from 42% to 89%.

Multi-agent Video generation AI Agents

SIG

HYP

arXiv cs.AI·May 19

Task Abstention for Large Language Models in Code Generation

Method enabling LLMs to abstain from code generation tasks prone to hallucination. Uses calibrated abstention rule grounded in multiple hypothesis testing, assesses consistency through code execution outcomes. Provides distribution-free theoretical guarantee. Evaluated on open-source code LLMs.

Code generation AI safety Evals

SIG

HYP

arXiv cs.AI·May 19

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

MAVEN is a multi-agent prompt refinement framework improving cultural fidelity in text-to-video generation. It decomposes prompts into person, action, and location dimensions handled by specialized agents. Benchmark of 243 culturally grounded prompts and 972 videos (Chinese, American, Romanian) with CLIP and VLM-as-judge evaluation.

Multi-agent Video generation Benchmarks

SIG

HYP

arXiv cs.AI·May 19

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

GeoWorld-VLM enhances spatial reasoning in Vision-Language Models by distilling geometric structure from frozen camera-conditioned video world models. The method fine-tunes only the image encoder and multimodal projector, aligning post-projector features with world-model representations. Achieves ~4% improvements on What'sUp and VSR benchmarks.

Vision Reasoning Fine-tuning

SIG

HYP

arXiv cs.AI·May 19

EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control

EfficientTDMPC improves sample efficiency for continuous control in model-based reinforcement learning. The method uses an ensemble of dynamics models, averages return estimates across multiple rollout depths, and adds an uncertainty penalty to the planner objective. It achieves SOTA on HumanoidBench-Hard and DMC hard in low-data regimes.

Reinforcement learning Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 19

Knowledge-to-Verification: Exploring RLVR for LLMs in Knowledge-Intensive Domains

K2V extends reinforcement learning with verifiable rewards (RLVR) to knowledge-intensive domains through automated verifiable data synthesis and verification of LLM reasoning processes. Experiments demonstrate improved reasoning in these domains without significant degradation of general capabilities.

Reinforcement learning Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

CHI-Bench evaluates AI agents' ability to automate complex healthcare workflows (prior authorization, utilization management, care management) across 87 MCP tools and 20 applications. Best agent resolves only 28% of tasks; none exceed 20% on strict pass. Performance drops to 3.8% in single-session mode.

AI Agents MCP Benchmarks

SIG

HYP

arXiv cs.AI·May 19

GraViti: Graph-Level Variational Autoencoders with Relaxed Permutation Invariance

GraViti is a transformer-based variational autoencoder for entire graphs, producing a true graph-level latent space. On molecular benchmarks, the model learns to decode valid samples respecting chemical constraints. The work shows that enforcing permutation invariance can be detrimental for consistent reconstruction when a reliable canonical node ordering exists.

Papers Benchmarks Code generation

SIG

HYP

arXiv cs.AI·May 19

A Scalable Tool for Measuring Manner and Result Verbs in Developmental Language Research

Computational tool for classifying manner and result verbs at scale. Uses linguistically informed prompts with LLMs to generate annotations over MASC and InterCorp (436 VerbNet classes), then trains a RoBERTa-based classifier. Performance: 89.6% accuracy across three gold-standard datasets.

Benchmarks Fine-tuning

SIG

HYP

arXiv cs.CL·May 19

Context Memorization for Efficient Long Context Generation

Training-free method to optimize long-context inference: attention-state memory externalizes prefix into lightweight lookup-based memory of precomputed attention states. On LLaMA-3.1-8B, improves in-context learning at 1K-8K tokens, reduces attention latency by 1.36x at 8K, outperforms full-attention RAG with 20% less memory.

Llama RAG Reasoning

SIG

HYP

arXiv cs.CL·May 19

Leveraging Graph Structure in Seq2Seq Models for Knowledge Graph Link Prediction

GA-S2S combines a T5-small encoder-decoder with a Relational Graph Attention Network for knowledge graph link prediction. The model jointly encodes textual features and full k-hop subgraph topology around the query entity, capturing multi-hop relational patterns. On CoDEx, GA-S2S outperforms Seq2Seq baselines with up to 19% relative accuracy gain.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Learning How to Cube

A neuro-symbolic post-training framework trains a 4B-parameter model to generate cubing heuristics for SAT via SFT+DPO. The model achieves pass@5=53 on 100 SAT competition benchmarks, matching the best symbolic heuristic and surpassing Claude-Sonnet-4 (50). Data comes from an MCTS pipeline exploring splitting decisions over SAT competition formulas.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

\textsc{PrivScope}: Task-scoped Disclosure Control for Hybrid Agentic Systems

PrivScope is an on-device payload governor enforcing task-scoped disclosure at the local-cloud boundary for hybrid agentic systems. On 100 medical-booking workflows, it eliminates profile leakage (0.0% vs 17.7%), halves attacker re-identification (23.1% vs 64.3%), and preserves task success without cloud-side changes across GPT-4o-mini and Gemini 2.5 Flash.

AI Agents AI safety Benchmarks

SIG

HYP

arXiv cs.CL·May 19

How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking

BanglaMedVQA: new benchmark for medical visual question answering in Bangla with clinically validated image-question-answer pairs. Evaluation of foundation models (Gemini, GPT-4.1 mini, Gemma-3) reveals substantially lower performance than English, severe limitations in fine-grained medical reasoning and specialized diagnostics.

Benchmarks Vision Gemini

SIG

HYP

arXiv cs.AI·May 19

To Trust or Not to Trust: Authors' Response to AI-based Reviews

Study of 56 authors from 40 papers: 83.9% find AI reviews useful, 80.4% report AI identifies issues missed by humans, 82.1% incorporate AI feedback into final version. However, authors trust AI less than humans (51.8% report minor inaccuracies, 16.1% report serious errors). 96.4% would accept AI as internal review tool before submission.

Evals AI safety Regulation

SIG

HYP

arXiv cs.AI·May 19

Why Modeling Human Haptic Material Perception with AI Is Difficult

Position paper on challenges of modeling human haptic material perception with AI. Identifies three bottlenecks: scarcity of large, diverse haptic datasets; lack of standardized evaluation platforms and perceptual benchmarks; limitations in model capacity and interpretability. Calls for cross-disciplinary efforts to advance.

Papers Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

MATE: Solving Contextual Markov Decision Processes with Memory of Accumulated Transition Embeddings

MATE is a memory architecture for solving Contextual Markov Decision Processes (CMDPs). It replaces the intractable posterior belief with sum-aggregated memory, avoiding growing computational costs of Transformers and gradient issues of RNNs. Evaluations demonstrate computational advantages while achieving performance comparable to standard sequence-model baselines.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·May 19

How Few-Shot Examples Add Up: A Causal Decomposition of Function Vectors in In-Context Learning

Mechanistic study of in-context learning (ICL): n-shot function vectors decompose linearly into individual example contributions. Models adaptively reweight demonstrations via attention, favoring informative and unambiguous examples. Query-Key alignment dominates function vector quality.

Reasoning Evals Papers

SIG

HYP

arXiv cs.AI·May 19

Voice ''Cloning'' is Style Transfer

Researchers demonstrate that voice cloning is not faithful reproduction but style transfer: cloned voices are perceived as more authoritative, warm, and human-like than originals. Cloning also homogenizes speaker characteristics (accent, speaking rate). These findings reveal behavioral and ethical risks of the technology.

Voice AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

Wavelet Flow Matching for Multi-Scale Physics Emulation

Wavelet Flow Matching (WFM) is a generative emulator for multi-scale physical systems governed by PDEs. It performs optimal transport directly in the hierarchical wavelet space of a U-Net, without pre-trained autoencoders. On three chaotic fluid dynamics systems, WFM outperforms SOTA models in long-horizon stability, accuracy, and spectral coherence.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 19

Automatic Unsupervised Ensemble Outlier Model Selection--Extended Version

MetaEns, an automated framework for selecting unsupervised outlier detection model ensembles. Uses labeled meta-datasets to predict marginal ensemble gains and applies greedy selection with adaptive early stopping. Tested on 39 real-world datasets: outperforms baselines in average precision while using fewer models.

Evals Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

RAPT: Retrieval-Augmented Post-hoc Thresholding for Multi-Label Classification

RAPT is a retrieval-augmented post-hoc thresholding wrapper improving label set selection in multi-label classification without retraining. Applied to metric learners and fine-tuned transformers, RAPT achieves 0.87 Macro-F1 on industrial data, outperforming static baselines and few-shot LLMs (K=5) with 115x less inference time.

RAG Benchmarks Fine-tuning

SIG

HYP

arXiv cs.AI·May 19

Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

Study on controlled removal of safety alignment in language models to evaluate cybersecurity capabilities. Compares authorized-context prompting, refusal-direction projection, and LoRA-based de-alignment. On 60 tasks (Security-AR), task-only LoRA reaches 0.87 security score with 0.83 general capability, but increases out-of-scope unsafe compliance.

AI safety Alignment Fine-tuning

SIG

HYP

Hacker News (AI)·May 19

LLMCap – A proxy that hard-stops LLM API calls when you hit a dollar cap

LLMCap is a proxy that automatically stops LLM API calls when a dollar spending cap is reached. Cost control tool to prevent budget overruns when using LLM APIs.

Tools Infrastructure

SIG

HYP

Reddit r/MachineLearning·May 19

How to get rejected by IEEE T-PAMI with 'Excellent' scores?[D]

Junior researcher reports IEEE T-PAMI rejection despite three positive reviews (2 EXCELLENT, 1 GOOD). Editor cited a 4th reviewer whose positive review was allegedly withdrawn from the system before final decision. Six months after filing complaint with IEEE Ethics, no direct response received.

Papers Regulation

SIG

HYP

Hacker News (AI)·May 19

Google, Blackstone to Create AI Cloud Firm with In-House Chips

Google and Blackstone form a joint venture for an AI cloud platform with in-house chips. The initiative aims to reduce vendor dependency and provide optimized AI infrastructure to institutional clients.

Infrastructure Business

SIG

HYP

Le Big Data·May 19

Aïe ! Gemini Intelligence sera limité à quelques smartphones, le vôtre sera-t-il compatible ?

Google rolls out Gemini Intelligence on Android with hardware restrictions. Only compatible smartphones will access the new AI features, limiting initial adoption.

Gemini

SIG

HYP

Hacker News (AI)·May 19

Sieve – scans Cursor/Claude chat history for leaked API keys

Sieve scans Cursor and Claude chat history to detect leaked API keys. Useful tool for identifying exposed secrets in conversations with AI assistants.

Claude Tools AI safety

SIG

HYP

Hacker News (AI)·May 19

We built a runtime activation layer for autonomous AI agents

A team built a runtime activation layer for autonomous AI agents, enabling real-time control and oversight of agent behaviors without modifying the underlying model.

AI Agents AI safety Infrastructure

SIG

HYP

Hacker News (AI)·May 19

Research shows a clear and communicated AI stance acts as a powerful amplifier

Research demonstrates that a clear and communicated AI stance significantly amplifies organizational impact. Companies with explicit AI strategy outperform those without defined positioning.

Business

SIG

HYP

Hacker News (AI)·May 19

People who use ChatGPT for writing are accurate detectors of AI text (2025)

A 2025 study finds that regular ChatGPT users are more accurate at detecting AI-generated text than non-users. Results suggest increased familiarity with language model writing patterns.

GPT Evals

SIG

HYP

Hacker News (AI)·May 19

Can AI just replace me already? – A comparative AI-writing ID experiment

Comparative experiment testing AI's ability to automatically identify AI-generated versus human-written text. Results on detector effectiveness and implications for content authentication.

Evals

SIG

HYP

Hacker News (AI)·May 19

Google's Own AI Researchers Jockey for Access to Its Computing

Google's internal AI researchers compete for access to the company's computing resources. Demand for compute exceeds available supply, creating bottlenecks for research projects.

DeepMind Infrastructure

SIG

HYP

Reddit r/MachineLearning·May 19

We built a tool that installs frameworks like ComfyUI, Ollama, OpenWebUI etc on any cloud GPU in one command and saves your whole setup between sessions [R]

swm is an open-source tool automating framework installation (ComfyUI, Ollama, OpenWebUI, vLLM) on cloud GPUs in one command. It aggregates pricing across 10+ providers (RunPod, Vast.ai, Lambda), syncs workspaces via S3, and auto-terminates idle instances after 30 min to cut costs.

Tools Open source Infrastructure

SIG

HYP

Hacker News (AI)·May 19

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference

SuperInfer introduces rotary scheduling and memory management for LLM inference optimized to meet SLO (Service Level Objectives). System-level approach to reduce latency and memory consumption.

Infrastructure Benchmarks

SIG

HYP

Simon Willison·May 19

The last six months in LLMs in five minutes

Simon Willison summarizes six months of LLM developments in a five-minute lightning talk at PyCon US 2026. The November 2025 inflection point marks a critical turning point, especially for coding. The best model changed hands 5 times between Anthropic, OpenAI, and Google.

Claude GPT Gemini

SIG

HYP

Hacker News (AI)·May 19

Google, Blackstone plan AI cloud venture with $5B backing, WSJ reports

Google and Blackstone plan a joint AI cloud venture backed by $5 billion in funding, according to WSJ reporting. The project aims to provide specialized cloud infrastructure for enterprise AI applications.

Infrastructure Business

SIG

HYP

Hacker News (AI)·May 19

Melbourne psychiatrist refuses new patients who don't consent to AI note-taking

A Melbourne psychiatrist refuses new patients who don't consent to AI note-taking. The practice raises ethical questions about consent, privacy, and automation of medical records.

Regulation AI safety Alignment

SIG

HYP

Reddit r/LocalLLaMA·May 19

club-5060ti follow-up: cleaner RTX 5060 Ti local LLM recipes, benchmark explorer, and CUDA GPU compatibility notes

Updated club-5060ti project: structured benchmark and recipe repo for local LLMs on RTX 5060 Ti. Includes static results explorer, schema-validated JSON, single/dual-card recipes, llama.cpp/vLLM support. Baseline: RTX 5060 Ti 16GB. Recommends llama.cpp/GGUF for mixed GPUs; vLLM NVFP4/MTP Blackwell-specific.

Open source Benchmarks Infrastructure

SIG

HYP

Hugging Face Blog·May 19

Introducing the Ettin Reranker Family

Hugging Face introduces the Ettin Reranker family, models designed to improve search relevance and RAG result ranking. These rerankers optimize document ranking after initial retrieval.

RAG Vector search Tools

SIG

HYP

Vercel AI Blog·May 19

Flat Rate CDN in Limited Beta

Vercel launches Flat Rate CDN in Limited Beta for Pro teams. This service replaces usage-based pricing with a fixed monthly fee, covering traffic spikes without overage charges.

Infrastructure Business

SIG

HYP

Reddit r/MachineLearning·May 18

Released a free 9.8M doc Indic multilingual corpus — Hindi, Bengali, Tamil, Telugu + 7 more (CC0, HuggingFace) [P]

Free multilingual corpus of 9.8M documents across 11 Indic languages (Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu, English). 8.4B tokens, CC0 license, available on HuggingFace.

Open source Embeddings

SIG

HYP

Vercel AI Blog·May 18

Run Claude Managed Agents with Vercel Sandbox

Vercel integrates Claude Managed Agents with Vercel Sandbox. Agents run in isolated Firecracker microVMs with access to private APIs and customer data. Credential brokering and deny-by-default egress secure execution.

Claude AI Agents Infrastructure

SIG

HYP

Hacker News (AI)·May 18

Tech bros say AI can be your best friend. Experts explain why it can't

Tech entrepreneurs claim AI can become your best friend. Experts debunk this, highlighting fundamental limitations: lack of consciousness, genuine empathy, and emotional reciprocity in current systems.

AI safety Alignment

SIG

HYP

Hacker News (AI)·May 18

AI-Governed EV Charging Could Extend Battery Life Nearly 23%

AI-governed EV charging system could extend battery lifespan by nearly 23%. Research demonstrates optimization of charging cycles through machine learning algorithms.

Reinforcement learning

SIG

HYP

Hacker News (AI)·May 18

Show HN: Clawputer – A personal AI assistant with a real computer and memory

Clawputer is a personal AI assistant with access to a real computer and persistent memory. The project, shown on Hacker News, provides an interface for AI to interact directly with the operating system and retain context across sessions.

AI Agents Tools

SIG

HYP

Reddit r/LocalLLaMA·May 18

Memory expert suspects RAM price drop in 2027'H2 due to china heavy investments

Former Samsung executive predicts RAM price drop in H2 2027 driven by aggressive Chinese memory chip investments. ChangXin Memory Technologies (CXMT) expanding capacity from 280k to 300k+ wafers/month via $4.2B Shanghai IPO, focusing on HBM and advanced DDR5.

Infrastructure Business

SIG

HYP

Reddit r/MachineLearning·May 18

MLRC 2026 is open for submissions - an official track at NeurIPS 2026 [N]

Machine Learning Reproducibility Challenge 2026 opens submissions as an official track at NeurIPS 2026 in Sydney, December. Accepted papers via TMLR are eligible for conference presentation.

Papers Benchmarks Evals

SIG

HYP

Reddit r/LocalLLaMA·May 18

21 GPU's benchmarked running a small TTS model (vram peak: 5GB)

Benchmark of 21 GPUs (mostly consumer) on OmniVoice TTS model (5GB VRAM peak). Tested via vast.ai, measures xRT (speed relative to real-time). RTX 3090 as baseline. 3 runs per GPU on small paragraph with voice cloning.

Voice Benchmarks Tools

SIG

HYP

Hacker News (AI)·May 18

Alignment pretraining: AI discourse creates self-fulfilling (mis)alignment

Theoretical piece arguing that public AI alignment discourse creates self-fulfilling prophecies. The author contends that dominant narratives about alignment risk shape actual model development, potentially generating the very problems the field aims to prevent.

Alignment AI safety

SIG

HYP

Reddit r/LocalLLaMA·May 18

MTP (Multi-Token Prediction): 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro

MTP (Multi-Token Prediction) accelerates LLM inference by 2x, especially for coding agents. Performance demonstration on Qwen 3.6 with AMD Strix Halo and Radeon 9700 AI Pro.

Qwen Code generation AI Agents

SIG

HYP

Reddit r/LocalLLaMA·May 18

Lemonade v10.5.1: an MTP + ROCm 7.13 quick start for Strix Halo

Lemonade v10.5.1 released with MTP and ROCm 7.13 support for Strix Halo. Enables loading Qwen3.6-27B-MTP-GGUF with auto-applied MTP arguments. Also fixes Fedora 43 support.

Qwen Open source Infrastructure

SIG

HYP

Vercel AI Blog·May 18

Consolidated Commit Status now available on GitHub

Vercel enables monorepos to consolidate GitHub commit statuses into a single status per PR instead of one per project. Teams configure GitHub branch protection once and manage which Vercel projects are required for merge in each project's settings.

Tools Infrastructure

SIG

HYP

Vercel AI Blog·May 18

Firewall‑mitigated traffic is free on Vercel

Vercel waives CDN Requests and Fast Data Transfer charges for traffic denied, challenged, or rate-limited by its Web Application Firewall (WAF). The change applies automatically to all projects using Vercel Firewall with no configuration needed.

Infrastructure Tools

SIG

HYP

Hacker News (AI)·May 18

NHS to close-source GitHub repos over AI, security concerns

NHS closes public GitHub repositories due to AI and security concerns. The decision aims to prevent AI model training on sensitive medical code and reduce security risks.

AI safety Regulation Open source

SIG

HYP

Hacker News (AI)·May 18

Show HN: Enhanced Copy – copy buttons that include the site's AI prompt

Enhanced Copy is a tool that adds copy buttons including the site's AI prompt. Enables easy capture of system instructions used by web services.

Tools Prompt engineering

SIG

HYP

Reddit r/MachineLearning·May 18

Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]

Custom CUDA runtime for small-batch inference (robotics, VLA, world models). Bottlenecks are not GEMM alone but runtime overhead: kernel fragmentation, layout transitions, precision conversions (FP8/FP4), Python scheduling. Results: Pi0.5 on RTX 5090 ~17.6ms, GROOT N1.6 ~12.5-13.1ms, Qwen 27B ~129 tok/s.

Code generation Infrastructure Robotics

SIG

HYP