Page 62 of 147

AllHigh signalRecent

5862 articles

I fine-tuned Cohere Transcribe to support diarization and timestamps

Developer fine-tuned Cohere Transcribe to add diarization (speaker identification) and timestamps. Model outputs parsable format with average temporal precision of ±0.097s. Supports up to 4 speakers per 30s, extensible to 32 with diarize_long.py script. Available free on Hugging Face.

Open source Fine-tuning Voice

SIG

HYP

Reddit r/LocalLLaMA·May 22

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Vector Policy Optimization (VPO) is an RL algorithm training language models to produce diverse solutions by anticipating multiple vector-valued reward functions. VPO replaces the GRPO advantage estimator and matches or beats scalar RL baselines across four tasks, with widening gaps as search budget grows.

Reinforcement learning Reasoning Code generation

SIG

HYP

The Decoder·May 22

OpenAI launches a ChatGPT Powerpoint plugin and warns it might accidentally delete your content

OpenAI launches a ChatGPT PowerPoint plugin in beta, able to create presentations from notes and documents, and edit existing slides. Available worldwide across all tiers. OpenAI recommends saving important decks before use due to risks of accidental deletion.

OpenAI Tools

SIG

HYP

Reddit r/LocalLLaMA·May 22

trained a prompt injection detector using ml-intern and DeepSeek v4 Flash, runs in the browser

Trained prompt injection detector using ml-intern and DeepSeek v4 Flash. DistilBERT achieves F1 99%, compressed to ONNX int8 (~65 MB), runs in browser via Transformers.js v3. Total API cost under $5 with DeepSeek.

DeepSeek AI Agents AI safety

SIG

HYP

Reddit r/LocalLLaMA·May 22

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop

ByteShape's CPU-5 quant for Qwen3.6-35B-A3B achieves 30% faster token generation than Unsloth UD-IQ4_XS on 6GB VRAM laptop GPU, with slightly slower prefill speed. Tested on RTX 3060 with 65536 token context.

Qwen Open source Tools

SIG

HYP

Reddit r/LocalLLaMA·May 22

I ran a quantization shootout on Qwen3-Coder and the results are... interesting

Quantization benchmark on Qwen3-Coder-Next using 3× R9700 PRO. UD-Q5_K_M outperforms MXFP4_MOE on all quality metrics (94% vs 89.4% top-1 accuracy, KL divergence 0.0217 vs 0.0746) with negligible speed penalty (~10% decode). Unsloth's dynamic precision approach exponentially reduces cumulative errors on long outputs.

Qwen Code generation Fine-tuning

SIG

HYP

Reddit r/LocalLLaMA·May 22

Qwen-27B-IQ4_KS for ik_llama.cpp, especially for NVIDIA with 16GB VRAM

New Qwen-27B-IQ4_KS quantization optimized for 16GB NVIDIA GPUs via ik_llama.cpp. 14.1GB model delivers performance comparable to previous IQ4_XS, 1.5-1.75x faster, 105k token context window. Tests: Needle In Haystack 100k passed, perplexity 71.10.

Qwen Open source Tools

SIG

HYP

Reddit r/LocalLLaMA·May 22

Open source: cloned Rocky's voice from Project Hail Mary in two days, full pipeline + 2:10 of training audio + trained RVC v2 model

Rocky's voice (Project Hail Mary) cloned in two days via open-source pipeline. Audio extraction (ffmpeg + demucs), transcription (Whisper), diarization (pyannote), then RVC v2 training on 2:10 min audio. Trained .pth model (55MB) and code public. Tested XTTS v2 / YourTTS / RVC v2 / OpenVoice v2.

Voice Open source Code generation

SIG

HYP

arXiv cs.AI·May 22

Neural Estimation of Pairwise Mutual Information in Masked Discrete Sequence Models

Neural method to estimate pairwise conditional mutual information in masked diffusion models (MDMs). Framework uses hidden states from pretrained MDMs with supervision from ground-truth MI computed from model's conditional distributions. Applied to Sudoku and protein sequence generation (ESM-C), reduces inference forward passes by 3-5x via MI-guided parallel decoding while outperforming entropy-based methods.

Papers Reasoning Code generation

SIG

HYP

arXiv cs.CL·May 22

Residual Skill Optimization for Text-to-SQL Ensembles

DivSkill-SQL optimizes Text-to-SQL ensembles through residual skill learning on failed examples, without fine-tuning. On Spider2-Lite, gains of +11.1 pts (Snowflake) and +8.3 pts (BigQuery) vs baseline. Skills transfer across SQL dialects without retraining.

Code generation AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·May 22

$ECUAS_n$: A family of metrics for principled evaluation of uncertainty-augmented systems

New ECUAS_n metric family for evaluating uncertainty-augmented systems that output predictions and uncertainty scores. Formalized as proper scoring rules, they enable tuning trade-offs between prediction errors and uncertainty imprecision per use-case. Validated on classification, generation, and TriviaQA.

Evals Benchmarks AI safety

SIG

HYP

arXiv cs.AI·May 22

Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

Insights Generator is a multi-agent system for diagnosing LLM agent failures at corpus scale. It formulates and tests hypotheses across execution traces to produce evidence-backed insight reports. Human experts using IG improve performance by 30.4pp; coding agents show consistent gains.

AI Agents Multi-agent Evals

SIG

HYP

arXiv cs.AI·May 22

AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

AgentCo-op is a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs. Applied to genomics and coding/math benchmarks, it coordinates specialized agents without global topology search and reduces per-task cost versus multi-agent baselines.

Multi-agent AI Agents Code generation

SIG

HYP

arXiv cs.LG·May 22

Quantitative coronary calcification analysis for prediction of myocardial ischemia using non-contrast CT calcium scoring

ML study on 1,375 patients: predicting myocardial ischemia from non-contrast CT calcium scans. XGBoost+SHAP model combining Agatston score, 8 calcium-omics features, and age. Results: 98.9% precision, 79.2% sensitivity, 87.7% F1. Calcium-omics features significantly improve performance vs clinical variables alone (p<0.05).

Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 22

Pseudo-Siamese Network for Planning in Target-Oriented Proactive Dialogues

Novel Pseudo-Siamese architecture (FF-BPSN) for planning dialogue paths toward predefined targets. Uses two bidirectional transformer decoders with forward-focused module. Tested on DuRecDial and DuRecDial 2.0, significantly improves target-oriented proactive dialogue systems.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 22

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

Study on reducing sycophancy (model agreement even when user is wrong) using off-the-shelf persona vectors. Vectors steered toward doubt/scrutiny reduce sycophancy to 68–98% of CAA's (Contrastive Activation Addition) effect while maintaining accuracy. Sycophancy is a persona-level property, not a single steerable direction.

Alignment AI safety Evals

SIG

HYP

arXiv cs.CL·May 22

Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements

Ishigaki-IDS-Bench is a benchmark for evaluating generation of Information Delivery Specification (IDS) XML files from BIM requirements. On 166 expert-validated examples in English/Japanese, the 10 best LLMs reach 65.6% macro F1 for content agreement, but only 27.7% pass the IDS Content audit. Models struggle to generate XML conforming to IDS standards and IFC vocabulary constraints.

Benchmarks Code generation Papers

SIG

HYP

arXiv cs.CL·May 22

A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering

Comparative study of RAG systems for Khmer. BGE-M3 outperforms Jina-Embeddings-v3 and Qwen3-Embedding in dense retrieval (Hit Rate@3: 0.285). Evaluation of 5 generators (Qwen3, Qwen3.5, Sailor2, SeaLLMs-v3, Llama-SEA-LION-v2) on 200 QA pairs using 6 RAGAS metrics. No single model dominates all criteria; retriever selection remains the bottleneck.

RAG Embeddings Benchmarks

SIG

HYP

arXiv cs.AI·May 22

Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification

Study on quantization of LLaMA-3.1 (8B) for qualitative analysis. 8-bit models maintain best precision; 4-bit, 3-bit, and 2-bit models suffer hallucinations. A multi-pass verification method reduces errors and stabilizes results, making low-bit models viable for qualitative research.

Llama Prompt engineering Evals

SIG

HYP

arXiv cs.CL·May 22

Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

Comparative study of four chunking strategies (Recursive, Khmer-Aware, Sentence-Based, LLM-Based) for RAG on Khmer agricultural documents. Recursive chunking with 300 characters achieves best performance: L2 distance 0.4295, Answer Relevance 0.8663, Khmer IoU 0.6441. Statistically significant improvement over Sentence-Based (p=0.0121).

RAG Embeddings Benchmarks

SIG

HYP

arXiv cs.CL·May 22

Hypergraph as Language

Hyper-Align introduces a hypergraph-native framework for LLMs. The method compiles hypergraph structures into tokens via HIDT-O (hybrid template) and HIP (incidence projector), preserving high-order associations. Evaluated on HyperAlign-Bench, it outperforms existing methods on vertex and hyperedge tasks.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 22

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

arXiv study on implicit Chinese toxicity attacks (CITA). Three-stage red-teaming framework (harmful intent learning, implicit toxicity enhancement, obfuscation rewriting) generating evaluation data. Seven tested detectors show 69.48% average miss-detection rate. Defense model CITD fine-tuned on CITA data improves robustness.

AI safety Alignment Evals

SIG

HYP

arXiv cs.LG·May 22

Representation Gap: Explaining the Unreasonable Effectiveness of Neural Networks from a Geometric Perspective

New arXiv paper introduces the 'Representation Gap', a metric related to neural network generalization error with better asymptotic dynamics. Authors derive precise asymptotic equivalence governed by task intrinsic dimension, validated on synthetic and realistic datasets.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·May 22

Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos

Mean-field theory of dropout as perturbation of critical signal propagation at edge of chaos. Authors derive scaling laws and show smooth activations and ReLU-like activations form distinct universality classes. Front-loaded dropout schedules reduce test loss at no extra computational cost.

Papers Reasoning Evals

SIG

HYP

arXiv cs.AI·May 22

AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists

AiraXiv is an AI-driven open-access preprint platform designed for human and AI authors. It integrates AI-augmented analysis, reader feedback, and Model Context Protocol (MCP) for automated interactions. Deployed as submission platform for ICAIS 2025.

MCP AI Agents Papers

SIG

HYP

arXiv cs.LG·May 22

$\textit{BlockFormer}$ : Transformer-based inference from interaction maps

BlockFormer, a transformer architecture, infers parameters from interaction maps (Hi-C) by formulating the problem as a generic inverse problem. The method uses synthetically generated data from a custom simulator to accurately localize centromeres across multiple species.

Papers Reasoning

SIG

HYP

arXiv cs.CL·May 22

Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

A framework uses an LLM to translate natural language queries into deterministic spatial operations against a PostGIS database. Tested on Massachusetts transportation safety data (crash records, roadway attributes, schools, bus stops), the system validates 29% of erroneous queries through a rule-based layer, preserving reproducibility while democratizing data access.

RAG AI Agents Evals

SIG

HYP

arXiv cs.LG·May 22

PeakFocus: Bridging Peak Localization and Intensity Regression via a Unified Multi-Scale Framework for Electricity Load Forecasting

PeakFocus is a unified framework for electricity load peak forecasting (ELPF), simultaneously predicting peak timing and intensity. It combines a peak-aware pipeline with triple hybrid loss, a multi-scale peak locator, and a location-aware decoder to overcome two-stage approach limitations. Evaluated on ELC and WLEL datasets.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 22

Harnesses for Inference-Time Alignment over Execution Trajectories

Study of harness engineering for inference-time alignment of LLM agents. Authors decompose harnesses into task decomposition and guided execution mechanisms. They identify failure modes (over-decomposition, over-pruning) and show partial harnesses specifying only initial steps can outperform fully structured workflows.

AI Agents Prompt engineering Reasoning

SIG

HYP

arXiv cs.LG·May 22

Equilibrium Propagation and Hamiltonian Inference in the Diffusive Fitzhugh-Nagumo Model

Extension of Equilibrium Propagation framework to skew-gradient systems with demonstrated equivalence between deep Energy-Based Models and Hamiltonian neural networks. Applied to diffusively coupled Fitzhugh-Nagumo neuron networks, showing stationary solutions admit spatial Hamiltonian structure enabling Hamiltonian Echo Backpropagation methods.

Papers Reasoning Reinforcement learning

SIG

HYP

arXiv cs.CL·May 22

Unified Data Selection for LLM Reasoning

HES (High-Entropy Sum) is a training-free metric for selecting high-quality reasoning data in LLMs. Tested across SFT, RFT, and RL paradigms, it achieves full-dataset performance using only the top 20% of samples, significantly reducing computational overhead.

Reasoning Fine-tuning Reinforcement learning

SIG

HYP

arXiv cs.AI·May 22

VBFDD-Agent for Electric Vehicle Battery Fault Detection and Diagnosis: Descriptive Text Modeling of Battery Digital Signals

VBFDD-Agent is a vehicle battery fault detection and diagnosis agent for electric vehicles using large language models. The system converts lithium-ion battery signals into natural language descriptions, integrates historical case retrieval and local maintenance manuals to generate structured, interpretable diagnostic results and maintenance recommendations.

AI Agents RAG Reasoning

SIG

HYP

arXiv cs.LG·May 22

Hierarchical Variational Policies for Reward-Guided Diffusion

Hierarchical variational framework for adapting pretrained diffusion models to reward-aligned objectives. Formulates test-time adaptation as a lightweight stochastic policy that amortizes per-step control. On 4x super-resolution: better perceptual quality with 5x faster inference than best baseline.

Reinforcement learning Image generation

SIG

HYP

arXiv cs.AI·May 22

Governance by Construction for Generalist Agents

CUGA introduces a modular governance system for generalist LLM agents in enterprise settings. Through five enforcement checkpoints (Intent Guard, Playbook, Tool Guide, Tool Approvals, Output Formatter), the platform enforces policies without model fine-tuning, ensuring compliance and auditability across compound workflows.

AI Agents AI safety Alignment

SIG

HYP

arXiv cs.CL·May 22

Pattern-and-root inflectional morphology: the Arabic broken plural

Computational model of Arabic inflectional morphology focused on broken plurals. Reverses traditional root-and-pattern paradigm into pattern-and-root. Applied to 3,200 noun entries with 160 inflectional classes (22 triliteral patterns, 3 quadriliteral). Formal separation of inflection, derivation, and semantics.

SIG

HYP

arXiv cs.AI·May 22

Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

New inference-time method for flow models: Conflict-Aware Additive Guidance (CAR) corrects off-manifold drift when composing multiple constraints. Dynamically detects and resolves gradient conflicts. Validated on image editing, planning, and control tasks.

Reasoning Evals Code generation

SIG

HYP

arXiv cs.AI·May 22

Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

DDS (Declarative Data Services) is an architecture for structured agentic discovery of data-system compositions. Addressing unbounded agentic discovery failures, the framework decomposes search into typed sub-searches via four contracts (intent, operator DAG, skills, runtime attribution). Tested on a trading-backend workload, DDS converges where unbounded approaches fail.

AI Agents Multi-agent Papers

SIG

HYP

arXiv cs.LG·May 22

I-SAFE: Wasserstein Coherence Metrics for Structural Auditing of Scientific AI Models

I-SAFE is a post-hoc auditing framework for scientific AI models based on the Wasserstein Coherence Metric (WCM). It evaluates whether model predictions reflect domain structure or exploit statistical shortcuts. Tested on drug-target interaction prediction (DeepConvDTI, DeepDTA, TAPB), it reveals distinct distributional response profiles invisible to accuracy metrics.

Evals AI safety Alignment

SIG

HYP

arXiv cs.AI·May 22

High Quality Embeddings for Horn Logic Reasoning

Method for creating high-quality embeddings for Horn logic reasoning. Authors use triplet loss with three innovations: anchors with repeated terms, balanced easy/medium/hard examples, and periodic emphasis of hardest cases. Evaluation across multiple knowledge bases.

Embeddings Reasoning Papers

SIG

HYP

arXiv cs.CL·May 22

GHI: Graphormer over Conditioned Hypergraph Incidence for Aspect-Based Sentiment Analysis

GHI is a Graphormer-based framework for aspect-based sentiment analysis (ABSA). It uses a bipartite hypergraph structure to represent token-hyperedge incidence relations, integrating linguistic and semantic signals. With 247M parameters, GHI outperforms DeBERTa on six SemEval benchmarks and approaches Flan-T5 11B performance on ISE.

Papers Benchmarks Reasoning

SIG

HYP