Page 58 of 146

AllHigh signalRecent

5839 articles

The Need for an External Observer Formalizing the Sufficiency Gap: A Mathematical Extension of Mixture Identifiability and Contextual Grounding in Sequence Models

Theoretical paper on sequence models' insufficiency when facing unobserved latent states. Authors formalize a mixed-regime process where a perfect predictor becomes overconfident if observed context matches the wrong latent regime. They show the sufficiency gap can only be closed by perfect revelation of latent state or equivalent verification mechanism.

Reasoning Alignment AI safety

SIG

HYP

arXiv cs.CL·May 27

The Daily Dose: Workflow-Integrated Large Language Model Automation for Clinical Summarization and Trial Identification in Radiation Oncology

The Daily Dose (TDD) is an LLM-driven system integrated into routine radiation oncology practice for automated clinical summarization and trial identification. Evaluation of 55 clinicians: 83.6% use TDD daily, mean satisfaction 3.89/5, 27% report ≥10 minutes saved per day.

Code generation RAG Business

SIG

HYP

arXiv cs.CL·May 27

Slide Deck Q&A Quality Assurance App: A Multi-Stage Pipeline for Pedagogical Question Generation

slidesqaqa is a Flask system generating pedagogical questions from PDF presentations. A 4-stage LLM pipeline (window planning, deck synthesis, slide annotation, reconciliation) processes text and images to produce coherent, non-redundant questions with evaluation scores in structured JSON output.

Code generation RAG Vision

SIG

HYP

arXiv cs.CL·May 27

Model Unlearning Objectives Vary for Distinct Language Functions

arXiv paper on selective unlearning in LLMs. Authors propose two distinct methods: a cosine-based RMU variant for dangerous-knowledge unlearning, and a multi-layer objective for toxicity reduction. Tested on 4 open-source 7-8B models, approaches show unlearning requires function-specific objectives, analogous to LLM post-training.

AI safety Alignment Papers

SIG

HYP

arXiv cs.CL·May 27

Reliable Extraction of Clinical Follow-Up Instructions: A Hybrid Neural-Symbolic Pipeline

Hybrid neural-symbolic pipeline extracts clinical follow-up instructions (action, date) from outpatient notes. BioBERT + BIO tagging + biaffine linker + deterministic date normalization outperforms GPT-4o-mini and fine-tuned LLaMA-3: Pair F1 0.997 (seen) vs 0.51-0.57 for baselines.

Benchmarks Code generation Reasoning

SIG

HYP

arXiv cs.CL·May 27

LATTE: Forecasting Peer Anchored Preference Trajectories for Personalized LLM Generation

LATTE is a personalization framework for frozen LLMs that forecasts user preference trajectories by subtracting comparable peer profiles. A lightweight sequence predictor forecasts the next state, injected via a single anchored soft token. On Amazon Reviews 2023, LATTE achieves ROUGE-L=0.259 vs 0.219 for static profiles.

Prompt engineering Fine-tuning Benchmarks

SIG

HYP

arXiv cs.AI·May 27

BrickAnything: Geometry-Conditioned Buildable Brick Generation with Structure-Aware Tokenization

BrickAnything generates physically buildable brick structures from 3D shapes using an autoregressive framework. The method introduces structure-aware tree tokenization to model brick dependencies, with validity-constrained decoding and preference-based alignment to improve stability and geometric fidelity.

Papers Code generation Reasoning

SIG

HYP

arXiv cs.AI·May 27

Constraint acquisition needs better benchmarks

MPMMine is a benchmark suite for evaluating Constraint Acquisition (CA) algorithms that discover, validate, and enhance Mathematical Programming models. It standardizes domain knowledge artifacts in open formats (MiniZinc, CommonMark, JSON) and provides thousands of solutions/non-solutions to improve reproducibility and cross-study comparability.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 27

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

Comparative study of three LLM approaches on 1,000 math problems (GSM-Symbolic): chain-of-thought (CoT), Program-Aided Language models (PAL), and Step-by-Step Coding (SBSC). CoT proves more robust to variations (1.3pp drop vs 1.7pp for PAL), contradicting the hypothesis that code execution improves reasoning robustness.

Reasoning Code generation Benchmarks

SIG

HYP

arXiv cs.AI·May 27

FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

FAST-GOAL enhances CLIP to handle lengthy text descriptions through global-local semantic alignment. The method combines efficient local region extraction (FLISM) and token similarity-based learning (TSL). A new GLIT100k dataset with global image-caption pairs and derived local pairs validates the approach on DOCCI, DCI, MSCOCO, Flickr30k.

Vision RAG Embeddings

SIG

HYP

arXiv cs.AI·May 27

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

Study on external tool use by medical AI agents under tool failures. Proposes GRPO-based RL framework with instance-level selection instead of task-level, probabilistic risk minimization rewards and disagreement-aware synergy learning. Evaluation on 7 medical benchmarks shows consistent robust improvements.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 27

On the Detection of Commutative Factors in Factor Graphs: Necessary and Sufficient Conditions

Theoretical paper on detecting commutative factors in probabilistic factor graphs. Authors identify a flaw in the state-of-the-art algorithm: the central theorem provides only necessary, not sufficient conditions. They propose a corrected version ensuring correctness while maintaining efficiency.

Papers Reasoning

SIG

HYP

arXiv cs.AI·May 27

Automatic Layer Selection for Hallucination Detection

Study on automatic hallucination detection in LLMs. Researchers propose FEPoID (First Effective Peak of Intrinsic Dimension), a training-free method to select optimal intermediate layers. Tested on QA and summarization, it outperforms existing baselines with negligible computational overhead.

Reasoning Evals

SIG

HYP

arXiv cs.LG·May 27

Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage

Reparametrization of Shampoo-based methods (KL-Shampoo, SOAP, KL-SOAP) enabling BFloat16 storage and reducing computational cost through subspace QR decomposition. Improves memory and time efficiency without performance degradation.

Reinforcement learning Benchmarks Papers

SIG

HYP

Hugging Face Blog·May 27

Reachy Mini goes fully local

Reachy Mini, Pollen Robotics' humanoid robot, now runs fully locally without cloud dependency. Integrates open-source models (Llama, Whisper) for vision, speech, and motor control. Deployed on embedded hardware.

Robotics Open source Llama

SIG

HYP

Vercel AI Blog·May 27

Experimental native binaries for Vercel CLI

Vercel CLI ships optional experimental native binary, faster and more secure without Node.js runtime dependency. Binaries are code-signed and credentials stored in system Keychain (macOS). Available on macOS, Linux, Windows for x64 and arm64.

Tools Infrastructure

SIG

HYP

Simon Willison·May 26

The pressure

Daniel Stenberg, curl maintainer, reports unprecedented surge in security reports: 4-5× higher than 2024, averaging over one per day. Reports are detailed and high-quality, AI-assisted. Despite extreme pressure, vulnerabilities found remain low to medium severity.

AI safety

SIG

HYP

Reddit r/LocalLLaMA·May 26

Cactus Hybrid Router: Gemma4-2B can match Gemini-3.1-Flash-Lite by routing 15-55% of tasks to Gemini And Running The Rest Locally.

Cactus Hybrid Router, a 65k parameter routing model, directs 15-55% of tasks to Gemini-3.1-Flash-Lite and runs the rest locally with Gemma4-2B. The system maintains performance even with 4-bit quantization and handles text, vision, and audio.

Gemini AI Agents Open source

SIG

HYP

ActuIA·May 26

GPT plus confiant sur les tâches difficiles où ils se trompe le plus, selon un preprint USC/Berkeley

GPT-4o, ChatGPT, and GPT-o3 display confidence exceeding their actual accuracy, with the gap widening on difficult tasks where they make the most mistakes. A USC/Berkeley preprint reveals growing divergence between stated confidence and real performance.

GPT OpenAI Evals

SIG

HYP

Reddit r/LocalLLaMA·May 26

I made a Windows app for managing llama.cpp in WSL/Ubuntu

llama.cpp Console is a Windows desktop app (WPF) to manage llama.cpp on WSL/Ubuntu without terminal. It automates WSL/Ubuntu setup, CUDA/Vulkan installation, GGUF model downloads from Hugging Face, and llama-server launch with real-time monitoring (tokens, GPU, logs).

Llama Tools Open source

SIG

HYP

Reddit r/MachineLearning·May 26

Augmented Equivariant Mesh Networks for Anatomical Mesh Segmentation (ICML 2026 Workshops) [R]

EAMS (Equivariant Anatomical Mesh Segmentor) applies rotational equivariance to mesh networks for 3D anatomical segmentation. The model (<2M parameters) maintains performance under geometric perturbations (40° rotation) where existing methods drop 25-26 IoU points. Evaluated on 4 clinical tasks (intracranial aneurysm, intraoral segmentation, liver).

Papers Vision Reasoning

SIG

HYP

Reddit r/LocalLLaMA·May 26

OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face

MOSS-TTS-v1.5 improves multilingual speech synthesis (31 languages), zero-shot voice cloning, and stability. New features: explicit pause control, better long-reference short-text cloning, more stable punctuation-driven prosody. Open-source model on Hugging Face.

Voice Open source Code generation

SIG

HYP

Reddit r/LocalLLaMA·May 26

Small set of local MCP server installers for home Linux users

MCP Basic Servers: open-source bundle of Bash installer scripts for local MCP servers on Linux. Six servers included (web, files, memory, contacts, wiki_verifier, weather) with HTTP endpoints on ports 8001-8006. Designed for beginner/intermediate users in home-lab setups, tested on Arch and Ubuntu.

MCP Open source Tools

SIG

HYP

Reddit r/MachineLearning·May 26

[P] I built a system that lets you ask questions about any GitHub repo and get answers grounded in the actual source code [P]

GitRAG lets users ask questions about any public GitHub repo and get answers grounded in source code with exact file paths and line numbers. System combines AST-aware parsing, dense embeddings, BM25 index, RRF fusion, and Cohere reranking before generation via llama-3.3-70b on Groq. Supports 15+ languages.

RAG Embeddings Code generation

SIG

HYP

The Decoder·May 26

The AI justice gap solution is slowly turning into an existential paperwork nightmare for US federal courts

MIT and USC study shows lawsuits filed without lawyers at US federal courts have nearly doubled since ChatGPT's mainstream adoption. One in five complaints now contains AI-generated text. Judges resort to drastic measures to handle the filing surge.

GPT Regulation AI safety

SIG

HYP

Reddit r/LocalLLaMA·May 26

Output Length Constrained Summarization using GRPO on tiny LLMs | smolcluster

GRPO fine-tuning study on tiny models (Qwen2.5-0.5B, LFM-2.5-350M) for Reddit post summarization constrained to exactly 64 tokens. Comparison of staged training (length first, then quality) vs joint training. Staged curriculum wins with G-Eval scores of 2.904 (LFM) and 2.817 (Qwen), vs 2.376/2.332 baseline zero-shot.

Qwen Fine-tuning Reinforcement learning

SIG

HYP

Reddit r/LocalLLaMA·May 26

Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.

A rejected PR for llama.cpp optimizes prompt processing (PP) for MOE models by up to 30% on Qwen 3.5 MoE 35B. Performance gains decrease with larger context windows. The patch can be manually applied to current llama.cpp releases.

Open source Code generation Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·May 26

I finally put my NPU (Intel Arrow Lake) to use doing ASR for my smart home

User optimized ASR (automatic speech recognition) on Intel Arrow Lake NPU via OpenVINO. Results: 4.8× faster and 10.7× less energy than CPU INT8 on 10s audio. NPU (13 TOPS) frees CPU and VRAM for other ML tasks, outperforming RTX 3060 eGPU in latency.

Code generation Voice Infrastructure

SIG

HYP

arXiv cs.LG·May 26

Cascade-KDE: Robust Time-Series Restoration under Out-of-Distribution Impulse Corruptions

Cascade-KDE is a training-free restoration method for time series corrupted by Gaussian noise and impulse outliers. It estimates temporal-amplitude density, applies Density-Truncated Robust Expectation to limit anomaly influence, then refines via exponential cascade. Tested on ECG and battery degradation, it preserves derivative peaks better than classical filters.

Benchmarks Evals

SIG

HYP

arXiv cs.CL·May 26

Discovering Lexical Gaps Using Embeddings from Multilingual LLMs

Automated framework to detect lexical gaps (words absent in certain languages) using embeddings from multilingual LLMs. On Korean-English translation pairs, 4000 embedding spaces show gap words have weaker cross-lingual semantic alignment. Logistic classifiers achieve AUC 0.81–0.76 and retrieve 18/19 and 26/27 gap words.

Embeddings Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 26

SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning

SLAP is a batch-aware data selection framework for instruction tuning that evaluates learnability at batch composition level rather than individual samples. Using stratified sampling and relative distance optimization with Hessian-approximated gradients, it matches full dataset performance with 20-40% less training data across LLaMA, ChatGLM, and diverse tasks (dialogue, translation, QA).

Fine-tuning Llama Benchmarks

SIG

HYP

arXiv cs.CL·May 26

CUNY at CLPsych 2026: A Pipeline Approach to Classification and Summarization of Mental Health Changes

CUNY submits a pipeline for CLPsych 2026 shared task: classifying mental health states via ensemble of three open-weight LLMs with majority voting, predicting timeline changes with supervised classifiers, and summarizing mood dynamics through augmented in-context learning. Rankings: 1st (Task 1.1), 4th (1.2, 2), 3rd (3.1).

Benchmarks Reasoning Open source

SIG

HYP

arXiv cs.CL·May 26

DRInQ: Evaluating Conversational Implicature with Controlled Context Variation

DRInQ is a benchmark evaluating LLM pragmatic reasoning on conversational implicature. Researchers reveal a generation-inference asymmetry: models generate plausible pragmatic scenarios but fail to recover intended implications at inference time. Structured prompting improves alignment for smaller models.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·May 26

Distinguishing Right from Wrong in Debates: Attribution Analysis of Chinese Harmful Memes

New arXiv paper on interpretable detection of harmful Chinese memes. Authors create Ex-ToxiCN-MM, first explanation dataset with opposing interpretations (harmful/non-harmful), and C-HarmKB, Chinese cultural knowledge base. They propose RIKE, attribution analysis framework with AKE and RIR modules, outperforming baselines. Code and data open-sourced.

Vision AI safety Evals

SIG

HYP

arXiv cs.CL·May 26

How Much Structure Do LLMs Need? Evaluating LLMs for Bibliometric Cluster Description

Study evaluating 6 LLM pipelines for generating bibliometric cluster descriptions. On 100 published analyses, LLMs produce semantically close descriptions to human versions but hallucinate references and fail to infer bibliometric structure alone. Optimal performance in hybrid workflow: algorithms define clusters, LLMs generate readable descriptions.

Benchmarks Evals RAG

SIG

HYP

arXiv cs.CL·May 26

Structure-Aware RAG: Structured Retrieval Augmented Generation from Noisy Data for Conversational Agents

SA-RAG uses structured tables as intermediate representation to improve RAG for conversational agents. A quality-aware metadata generation framework enhances table quality from noisy data. Generation validation and direct preference optimization outperform RAG baselines on two real-world datasets.

RAG AI Agents Papers

SIG

HYP

arXiv cs.CL·May 26

Decompose-and-Refine: Structured Legal Question Answering with Parametric Retrieval

DaR (Decompose-and-Refine) is a framework for answering complex legal questions by decomposing them into atomic sub-questions and generating statute-aligned parametric queries. Evaluated on KoBLEX (Korean multi-hop benchmark) using Qwen3-32B and Gemma3-27B, DaR improves retrieval accuracy and answer quality while reducing hallucinations.

Reasoning RAG Qwen

SIG

HYP

arXiv cs.AI·May 26

LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition

LC-ERD is a self-alignment framework for LLMs that mines latent logical structures via consistency-regulated reward decomposition. Addresses three challenges: label noise from mimetic bias, coarse-grained supervision, and distributional collapse. Uses Variational Logic Potential and multi-agent value decomposition based on IGM principle.

Reasoning Reinforcement learning Alignment

SIG

HYP

arXiv cs.CL·May 26

Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning

GuardedRepair is a guarded best-of-N repair framework for LLM mathematical reasoning that selectively fixes incorrect traces while preserving correct answers. On GSM8K (95.60% → 96.89%), it fixes 17 of 58 errors with no measured broken-correct cases. On weak-reasoner ASDiv, accuracy improves from 78.40% to 87.60%.

Reasoning Evals AI safety

SIG

HYP

arXiv cs.AI·May 26

In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models

Researchers replicate Picbreeder (interactive image evolution platform) by replacing human users with Vision Language Models (VLMs). Results show qualitative differences from human baseline. Study of causal factors: exploratory noise, behavioral diversity between agents, memory of past actions.

Vision AI Agents Open source

SIG

HYP