Page 18 of 192

AllHigh signalRecent

7679 articles

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

Comparative study of generic vs domain-specific embeddings for multilingual clinical search (ICD-10-CM). A bi-encoder fine-tuned on Gemini-generated synthetic data (6 languages) outperforms BioBERT-ST: R@5=0.822 vs 0.790, with major gains in Portuguese (+0.115). Open recipe for LLM-based medical retrievers.

Embeddings RAG Benchmarks

SIG

HYP

arXiv cs.AI·Jun 1

LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability

LLM-FACETS is an open-source framework for evaluating LLM factuality, epistemic calibration, and reproducibility. Web interface, plugin architecture, deterministic metrics (BLEU, ROUGE, BERTScore) run locally, log-probability visualization, multi-judge consensus, RAG Triad metrics. Designed for technical experts, domain experts, and compliance officers per EU AI Act and NIST standards.

Evals AI safety Alignment

SIG

HYP

arXiv cs.LG·Jun 1

SubsurfaceGen: Procedural Generation of Field-Scale Earth Models and Seismic Data

SubsurfaceGen is a GPU-accelerated generator for 3D velocity models and seismic data at field scale. Authors release a dataset of 4,276 2D slices covering 6 geological settings (10 km × 10 km × 6.19 km at 10 m resolution). Evaluation of neural operators on wavefield prediction and end-to-end velocity inversion with out-of-distribution testing.

Benchmarks Papers Open source

SIG

HYP

arXiv cs.AI·Jun 1

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

EHRBench is an automated and reliable benchmark for evaluating LLMs on clinical decision-making tasks. Built via an EHR-LLM-KB pipeline, it generates ~960k QA items covering diagnosis, treatment, and prognosis. 30+ LLMs benchmarked reveal persistent gaps toward clinical reliability.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.CL·Jun 1

Cross-Lingual Steering for Figurative Language Generation

Activation steering study across four multilingual LLMs (5 figurative categories, 6 languages). Directions learned in one language transfer effectively to others, particularly German. Composite cross-lingual directions match or exceed native directions, providing direct evidence of reusable but target-dependent figurative signals across languages.

Reasoning Multi-agent Papers

SIG

HYP

arXiv cs.CL·Jun 1

Configurable Reward Model for Balanced Safety Alignment

CSRM (Configurable Safety Reward Model) jointly optimizes calibrated safety compliance and reward modeling to adapt LLMs to heterogeneous and evolving safety requirements. Achieves 94.6% F1 on CoSApien and 75.8% F1 on DynaBench without additional human annotation.

AI safety Alignment Reinforcement learning

SIG

HYP

arXiv cs.LG·Jun 1

Discovering a Zeta Map Algorithm on Dyck Paths via Mechanistic Interpretability

Researchers train a small encoder-decoder transformer on the zeta map, a classical bijection in q,t-Catalan combinatorics. Mechanistic interpretability tools (cross-attention analysis, linear probing, causal intervention) reveal a level-based mechanism. Translation into an explicit peak-centered traversal algorithm (scaffolding map) proven equivalent to the zeta map.

Reasoning Papers

SIG

HYP

arXiv cs.CL·Jun 1

ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law

ImmigrationQA: source-grounded QA dataset of 17,058 pairs across 13 U.S. immigration law subdomains. Fine-tuned Llama 3.2 3B with LoRA on corpus of 10,056 validated documents. Fine-tuned model: 1.08/3.0 (16.8% fully correct) vs Llama 3 8B base: 0.85/3.0 (4% fully correct), 27% relative improvement. Cost: ~$29. Dataset, model, and code publicly released.

Llama Fine-tuning RAG

SIG

HYP

Reddit r/LocalLLaMA·Jun 1

I bolted an 8-arm reasoning MoE onto a frozen 1.4B Mamba backbone on a single RTX 3060. Here’s the mechanistic autopsy of what broke and what worked.

A researcher built Mamba-Titan-1.4B-Reasoning (2.54B params MoE) on RTX 3060 by freezing a 1.4B Mamba-1 backbone and adding 8 trainable experts. Trained on DeepSeek CoT traces, the model developed a 'vault door' mechanism: the </think> token isolates at the smallest norm (1.991 vs 4.742 mean) to control latent reasoning termination.

Reasoning Fine-tuning Open source

SIG

HYP

Reddit r/LocalLLaMA·May 31

13 abliterated Gemma 4 E2B variants, 44 GPU hours, Benchmark and Comparison - Abliterlitics

Systematic comparison of 13 abliterated Gemma 4 E2B variants across 44 GPU hours. coder3101 achieves 96% ASR (refusals) with full capability preservation and outperforms base model on math. Surgical approaches preserve performance better than aggressive methods, with some losing up to 6.9 points on GSM8K.

Gemini AI safety Alignment

SIG

HYP

Reddit r/LocalLLaMA·May 30

Parallax: Parameterized Local Linear Attention for Language Modeling

Parallax is a parameterized Local Linear Attention mechanism for LLMs derived from statistical regression. It replaces softmax's local constant estimate with a linear estimate, yielding better bias-variance tradeoffs. Pretrained at 0.6B and 1.7B scales, Parallax shows consistent perplexity improvements and matches or outperforms FlashAttention 2/3 in decoding.

Reasoning Benchmarks Papers

SIG

HYP

Reddit r/LocalLLaMA·May 29

vLLM PR adding native HIP W4A16 kernel was merged

vLLM merged a PR adding native HIP W4A16 kernel for ROCm. Benchmarks show significant gains: 270.2 tk/s in fp16 (max-num-seqs=8) and 445.7 tk/s (max-num-seqs=32), outperforming previous Triton implementations.

Open source Infrastructure Benchmarks

SIG

HYP

arXiv cs.AI·May 29

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

GTA is a framework for automatically generating complex web tasks with executable trajectories. It combines crawling, retrieval, in-context generation, and quality control across 50+ websites (e-commerce, government, forums, news). The benchmark reveals a significant performance gap between humans and AI agents.

AI Agents Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 29

Representation Alignment Rests on Linear Structure

Investigation of Platonic Representation Hypothesis through tripartite framework: signal (universal linear object-attribute relationships), bias (architecture differences, mitigated by centering/normalization), noise (word frequency correlation with alignment). Sparse autoencoders show stronger cross-modal alignment than dense representations.

Embeddings Papers Reasoning

SIG

HYP

arXiv cs.LG·May 29

PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation

PrismFlow introduces a new Flow Matching method for time-series generation using Koopman-inspired dynamical experts that learn residual corrections in latent space with a confidence-aware Winner-Take-All objective. Results: +15.6% Context-FID gain and +38.6% Discriminative Score improvement across benchmarks.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.CL·May 29

Hallucination Detection-Guided Preference Optimization for Clinical Summarization

Preference optimization method guided by hallucination detection to improve clinical summarization reliability. On Llama-3.1-8B-Instruct, reduces hallucinations by 24% at inference and 48% after fine-tuning, preserving fluency. Evaluated on MIMIC-IV.

Llama Fine-tuning AI safety

SIG

HYP

arXiv cs.CL·May 29

GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling

GenesisFunc is an automated multi-agent pipeline for generating function-calling training data. Starting from reliable tools in public benchmarks, the system produces diverse conversations with multi-stage quality control. An 8B model fine-tuned on this synthetic data outperforms similarly-sized open-source models in in-domain performance and out-of-domain generalization.

Multi-agent Code generation Fine-tuning

SIG

HYP

arXiv cs.CL·May 29

From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale

Study of 17 models (410M-100B+ parameters) showing instruction-tuning causes linguistic entropy collapse (amplification: 1,949-16,853%), independent of RLHF. Strong control (lambda=5.0) reduces this effect by 40.5% and outperforms frontier models by 96.7-98.2% despite 200-1000x scale disadvantage.

Papers Alignment Fine-tuning

SIG

HYP

arXiv cs.LG·May 29

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

arXiv paper proposing strategic robustness for simulator learning in MBRL. Formulates objective as minimax game between model and adversarial policy player. Proves convergence with sublinear regret bounds and Error-MDP duality. Experiments show 1.5–2.2× reduction in prediction error and simulation-trained policies matching near-optimal real-world performance.

Reinforcement learning Papers Reasoning

SIG

HYP

arXiv cs.AI·May 29

PRO-CUA: Process-Reward Optimization for Computer Use Agents

PRO-CUA introduces a process-reward optimization framework for training computer use agents (CUAs). The method decouples live environment interaction from policy optimization through iterative step-level reinforcement learning, using a process reward model (PRM) to provide dense feedback signals without relying on expert trajectories or golden answers.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.LG·May 29

Conf-Gen: Conformal Uncertainty Quantification for Generative Models

Conf-Gen adapts conformal prediction (CP) and conformal risk control (CRC) to generative models (LLMs, image generators, AI agents). The framework provides formal uncertainty guarantees for unsupervised tasks, extending conformal methodology to new domains.

Papers Evals AI safety

SIG

HYP

arXiv cs.AI·May 29

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

OpenClawBench is a dataset of 31,264 annotated trajectories to detect process-side anomalies in agent execution beyond task success. Among 31,135 passing executions, 2,904 contain anomalies (unresolved ambiguity, unsafe writes, ignored errors). A fine-tuned Gemma 3 12B detector reaches F1=0.729.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 29

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

Reasoning models maintain factually correct chain-of-thought traces but flip their final answer under sustained adversarial pressure in multi-turn dialogue. This unfaithful capitulation affects ~50% of cases in think mode and 11-15% without reasoning. The effect correlates with reasoning architecture (high in Qwen3-32B and GPT-OSS-20B, low in inline-CoT Gemma-4-31B-it).

Reasoning Evals AI safety

SIG

HYP

arXiv cs.LG·May 29

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

Comparative study of RL vs SFT on Qwen2.5-3B-Instruct: reinforcement learning better preserves internal circuits of the base model than supervised fine-tuning (SFT), which adapts faster but destroys more prior capabilities. Proposed metric: differential circuit vulnerability at attention head level.

Reinforcement learning Fine-tuning Papers

SIG

HYP

arXiv cs.AI·May 29

Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes

LLM agents (Claude and GPT) automatically annotate biological phenotypes by linking free-text descriptions to ontology terms. Tested on Dahrul et al. (2018) Gold Standard benchmark, all agents fall within inter-curator human variability, substantially outperforming the Semantic CharaParser NLP tool on all four metrics.

AI Agents Claude GPT

SIG

HYP

arXiv cs.AI·May 29

Orthogonal Concept Erasure for Diffusion Models

Orthogonal Concept Erasure (OCE) proposes an editing method to remove undesired concepts from diffusion models using multiplicative orthogonal transformations. Unlike existing additive approaches, OCE preserves neuron magnitude and angular geometry while precisely erasing concepts. The approach erases up to 100 concepts in 4.3 seconds.

Papers AI safety Alignment

SIG

HYP

arXiv cs.LG·May 29

Label-Free Reinforcement Learning via Cross-Model Entropy

Cross-Model Entropy (CME) proposes a label-free reward signal for LLM post-training RL. CME uses mean log-likelihood of responses under an independent verifier model, avoiding self-consistency and reward hacking. Integrated into GRPO, CME achieves 52.5–71.4% tie-adjusted win rates on UltraFeedback/AlpacaEval 2.0 across Qwen, Llama, Gemma, OLMo.

Reinforcement learning Llama Qwen

SIG

HYP

arXiv cs.AI·May 29

ReasonOps: Operator Segmentation for LLM Reasoning Traces

ReasonOps is an unsupervised method for analyzing chain-of-thought traces from LLMs. It identifies 7 recurring reasoning operators (backtracking, inferring, hypothesizing) from 44,662 traces across 12 models on 8 benchmarks. These operators enable source model identification at 70-76% accuracy and predict answer correctness before trace completion.

Reasoning Evals Papers

SIG

HYP

arXiv cs.AI·May 29

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

CoHyDE iteratively co-trains a dense encoder and LLM rewriter to improve tool retrieval over large API catalogs. On ToolBench (~10k tools), three rounds gain +2.5 pp NDCG@5 on standard queries and +6.3 pp on vague queries, outperforming single-component baselines.

AI Agents RAG Embeddings

SIG

HYP

arXiv cs.AI·May 29

Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies

A2X is a service discovery system for LLM agents that automatically organizes services into hierarchical taxonomies. It solves context scarcity by walking the hierarchy layer-by-layer, reducing token consumption by 89% while gaining 6.2 Hit Rate points over full-context dumping and +20 points over embedding baselines.

AI Agents MCP RAG

SIG

HYP

arXiv cs.LG·May 29

OISD: On-Policy Internal Self-Distillation of Language Models

OISD introduces on-policy internal self-distillation to improve language model reasoning. The final layer acts as a detached teacher for intermediate layers via logit alignment (reasoning behaviors) and attention alignment (attention patterns), without external privileged information. Positive results across four mathematical reasoning tasks.

Reinforcement learning Reasoning Papers

SIG

HYP

arXiv cs.LG·May 29

A Training-Time Diagnostic for Generalization via the Log-Alignment Ratio

Study of log-alignment ratio (LAR), a parameter-activation alignment metric. LAR predicts memorization-to-generalization transition in grokking (effective dimension k ≈ n^(2(1-LAR))) and 3B-parameter language model pre-training. Computable without validation data, negligible overhead.

Papers Reasoning Evals

SIG

HYP

arXiv cs.CL·May 29

Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

Study on source-dependence in multi-source medical RAG systems. Authors demonstrate that the same system can produce different answers depending on retrieved source, revealing a missing evaluation axis in NLP. They introduce TransplantQA (benchmark), HERO-QA (hierarchical retrieval strategy), and a structured judge to audit inter-source relationships using a validated 5-label taxonomy.

RAG Evals Papers

SIG

HYP

arXiv cs.CL·May 29

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

UA-Legal-Bench evaluates 11 LLMs (3B–675B) on 5 Ukrainian legal reasoning tasks from 99.5M court decisions. Results show task-dependent few-shot effects: +38.6 pp improvement for judgment form classification, but mixed effects on outcome prediction. Accuracy is misleading on imbalanced tasks: highest accuracy model (62%) is a majority-class predictor (macro-F1: 23%).

Benchmarks Evals Papers

SIG

HYP

arXiv cs.AI·May 29

Robust and Efficient Guardrails with Latent Reasoning

COLAGUARD, a guardrail model, transfers multi-step safety reasoning into continuous latent space via stage-wise training curriculum. Evaluated on 10 moderation tasks across 8 safety benchmarks, it improves macro-F1 by 8.24 points over Llama Guard 3, matches GuardReasoner performance while delivering 12.9X speedup and 22.4X token reduction.

AI safety Reasoning Evals

SIG

HYP

arXiv cs.CL·May 29

Reasoning that Travels: Dissecting How Chain-of-Thought Transfers Across Models

Study of chain-of-thought (CoT) transfer across models using a provider-receiver framework. Full traces often transfer successfully, but mechanisms vary: answer extraction (AIME), receiver competence (MMLU-Pro), or partial structured information (ZebraLogic). In free-generation mode, partial CoTs improve performance, suggesting guidance for continued reasoning.

Reasoning Prompt engineering Benchmarks

SIG

HYP

arXiv cs.LG·May 29

Model Merging by Output-Space Projection

Model merging formulated as convex quadratic programme over residual updates. Subsumes existing methods (task arithmetic, model soups, TIES, DARE) and provides closed-form diagnostic predicting merge quality via fraction of residual energy captured. Consistent gains across language and vision benchmarks.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 29

Specialty-Specific Medical Language Model for Immune-Mediated Diseases

Domain-specific NER model for identifying clinical entities in immunology and infectious disease contexts. 371 manually annotated case reports by clinical specialists. Transformer-based model with clinical embeddings achieves F1=0.89, outperforming BERT and zero-shot approaches. Supports case report analysis and clinical decision support.

RAG Fine-tuning Evals

SIG

HYP

The Decoder·May 28

Anthropic ships Claude Opus 4.8 as a "modest but tangible improvement" that tops GPT-5.5 in most benchmarks

Anthropic releases Claude Opus 4.8, outperforming GPT-5.5 and Gemini 3.1 Pro on most benchmarks. The model catches its own coding errors 4× better than its predecessor. Anthropic also rolls out dynamic workflows enabling hundreds of parallel sub-agents for codebase-wide migrations.

Claude Benchmarks Code generation

SIG

HYP

Reddit r/MachineLearning·May 28

Wall-OSS-0.5: 4B VLA with open training code and zero-shot real-robot evaluation[D]

Wall-OSS-0.5 is a 4B VLA from X Square Robot with open training code. Zero-shot evaluation on 17 real-robot tasks: 4 tasks >80% progress, including Rope Tightening (82%). Post fine-tuning: 60.5% average task progress (+17.5pp vs pi0.5). Mixture-of-Transformers architecture with vision-aligned RVQ tokenizer and distributed DMuon optimizer.

Robotics Vision Code generation

SIG

HYP