Page 5 of 192

AllHigh signalRecent

7679 articles

InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

InfoQuant proposes a training-free post-training quantization (PTQ) method for LLMs. It uses Peak Suppression Orthogonal Transformation (PSOT) to reshape activations into quantization-friendly distributions. On LLaMA-2 13B under W4A4KV4, it preserves 97% floating-point accuracy and reduces the performance gap by 42% over prior state-of-the-art.

Llama Papers Benchmarks

SIG

HYP

arXiv cs.CL·May 27

Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study

EnterpriseMem-Bench, a multi-turn Text-to-SQL benchmark with 1,400 turns across 300 sessions, evaluates GPT-5 mini, GPT-5.2, Claude Sonnet 4.5/4.6, and Opus 4.6. Key findings: without memory, accuracy collapses by Turn 3; working memory dominates complex architectures; Sonnet 4.6 regresses 17-33pp on SEC EDGAR vs Sonnet 4.5.

Benchmarks Code generation GPT

SIG

HYP

arXiv cs.CL·May 26

Raon-Speech Technical Report

Raon-Speech is a 9B multilingual speech language model (English/Korean) that understands and generates speech while preserving text capabilities. Trained on 1.38M hours of curated data, it outperforms 8 comparable audio models (Qwen2.5-Omni, Fun-Audio-Chat) across 42 benchmarks. Raon-SpeechChat extends it with real-time full-duplex conversation trained on 119K hours of dialogue.

Voice Benchmarks Open source

SIG

HYP

arXiv cs.CL·May 26

QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks

QUEST is a family of open-source models (2B to 35B) trained as deep research agents via data synthesis pipeline and RL. With only 8K synthetic tasks, QUEST matches or exceeds proprietary systems across 8 research benchmarks, excels at citation grounding and report synthesis. Models, data, and training scripts released.

AI Agents Reinforcement learning Open source

SIG

HYP

arXiv cs.LG·May 26

ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

ChaosBench-Logic v2 is a 40,886-question benchmark evaluating logical reasoning of 14 LLMs on 165 dynamical systems. The CARE protocol reveals critical failures: regime-transition reasoning remains near-random (MCC=0.05), while FOL deduction reaches MCC=0.52. Qwen 2.5-32B outperforms proprietary models on indicator diagnostics.

Benchmarks Reasoning Qwen

SIG

HYP

arXiv cs.LG·May 26

LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs

LLM-AutoSciLab proposes a closed-loop scientific discovery framework coupling hypothesis generation, hypothesis-conditioned experiment selection, and mechanism refinement. Evaluated on ActiveSciBench (57 enzyme-kinetics tasks, 45 gene-regulatory-network tasks), the system achieves 67.6% symbolic accuracy and 2-5x better sample efficiency than competing baselines.

Reasoning AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·May 26

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

EvoCode-Bench evaluates 13 coding agents on 26 tasks with 5-15 iterative rounds. Agents must maintain a working codebase as specifications change. Results: 22-40 point gap between single-round (SR) and multi-turn (MT@4) performance, <50% success on multi-turn metrics, and progressive degradation (pass rate halved by round 5).

Code generation AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·May 26

BODHI: Precise OS Kernel Specification Inference

BODHI, a domain knowledge prompting method, improves automated OS kernel specification generation via LLMs. Tested on 9 models (Anthropic, Mistral, Amazon, DeepSeek, Meta, Alibaba), it reaches 96.73% Pass@1 with Claude Opus 4.6 versus 55.10% baseline, by structuring C-to-Python translation across pattern categories.

Prompt engineering Benchmarks Code generation

SIG

HYP

arXiv cs.AI·May 26

How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning

Study quantifying reasoning redundancy in LLMs: 61-93% of thinking steps can be truncated without affecting correct answers. Analysis across 4 frontier models and 2 math benchmarks (MATH-500). Redundancy is structural, stemming from length-agnostic outcome rewards, not model-specific artifact.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 26

CSP-Atlas: Concept-Specific Neural Circuits in a Sparse Python Transformer

Study identifies 106 dedicated neural circuits in a sparse 8-layer transformer trained on Python code. Circuits organize by computational principles (atomicity, lexical ambiguity) rather than semantics. Up to 62.5% of loudest-firing neurons at mid-to-late layers are concept-specific for AST constructs.

Code generation Reasoning Papers

SIG

HYP

Reddit r/MachineLearning·May 25

𝐃𝐞𝐥𝐭𝐚 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐑𝐞𝐬𝐢𝐝𝐮𝐚𝐥𝐬 [R]

Delta Attention Residuals improves residual connections by routing over layer deltas (vᵢ = hᵢ₊₁ − hᵢ) instead of cumulative hidden states. Results: −8.2% PPL at 7.6B, 1.8× sharper cross-layer routing (max weight 0.2→0.6), <0.01% parameter overhead. Code and paper released.

Papers Benchmarks Open source

SIG

HYP

arXiv cs.LG·May 25

FuRA: Full-Rank Parameter-Efficient Fine-Tuning with Spectral Preconditioning

FuRA introduces full-rank parameter-efficient fine-tuning via spectral preconditioning through SVD decomposition. By freezing pretrained singular bases and optimizing only compact cores via block tensor-train factorization, FuRA outperforms full fine-tuning and LoRA on LLaMA-3-8B (+1.37 commonsense reasoning) and VLMs while maintaining LoRA-comparable efficiency.

Fine-tuning Llama Reinforcement learning

SIG

HYP

arXiv cs.CL·May 25

Brain-LLM Alignment Tracks Training Data, Not Typology

Brain-LLM alignment depends on training language dominance, not inherent English properties. Test on 112 participants (English, Chinese, French) with 7 LLMs: a Chinese-dominant model (Baichuan2-7B) reverses alignment gradient. Typological distance and tokenization fertility explain remaining variation.

Benchmarks Alignment Papers

SIG

HYP

arXiv cs.CL·May 25

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

Sparse autoencoders decompose GPT-2 XL and Llama-3.1-8B into 16K-32K interpretable features per layer. Semantic features alone recover 94% of peak encoding performance (r=0.285) and align with known cortical semantic organization (ρ=0.72, p<0.001). Results generalize across English, Chinese, and French.

Papers GPT Llama

SIG

HYP

arXiv cs.AI·May 25

ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

ImProver 2 is a neurosymbolic framework for automated proof optimization in Lean 4. A 7B-parameter model trained outperforms orders-of-magnitude larger models and is competitive with mid-tier frontier models. The scaffold exposes formal structure alongside lightweight informal abstractions.

Reasoning Fine-tuning Papers

SIG

HYP

arXiv cs.CL·May 25

Model Collapse as Cultural Evolution

Study showing model collapse (progressive degradation of LLMs trained on their own outputs) follows cultural evolution laws. Tests on LLaMA-2-7B and Mistral-7B over 10 generations in English, German, and Turkish reveal compositionality follows non-monotonic trajectory (rise then fall). Task-grounded filtering, not random filtering, sustains quality.

Llama Mistral Papers

SIG

HYP

arXiv cs.AI·May 25

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

IDS (Inductive Deductive Synthesis) is a multi-agent LLM system jointly synthesizing implementation and formal proof for distributed systems. On 7 key-value store specifications, IDS achieves 7/7 in 6.8h/$106, versus 2/7 for GPT-5.4 and Claude Opus 4.6. Result is 200x faster than expert effort, 17% cheaper than SOTA agents.

AI Agents Multi-agent Code generation

SIG

HYP

Reddit r/LocalLLaMA·May 24

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)

hipEngine is an open source (AGPLv3) LLM inference engine optimized for RDNA3 (RX 7900 XTX, W7900). Written in Python with HIP/C++ kernels, it runs Qwen 3.6 MoE faster than llama.cpp on prefill (2718 tok/s at 512 tokens vs 2436 for GGUF Q4_K_S). Near-lossless INT8 KVCache enables 256K context in <24GB.

Qwen Open source Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·May 24

BitCPM-CANN: Native 1.58-Bit Large Language Model Training on Ascend NPU

BitCPM-CANN demonstrates native ternary (1.58-bit) quantization-aware training on Huawei Ascend NPU. Four models (0.5B–8B) retain 95.7–97.2% of full-precision performance across 11 benchmarks (reasoning, GSM8K, BBH). Training overhead: 4.5%. Weight memory reduction: 8×, 6× end-to-end. First 1.58-bit training system scaled to 8B on domestic NPU.

Fine-tuning Benchmarks Open source

SIG

HYP

Reddit r/LocalLLaMA·May 23

Qwen3.6 35B-A3B MTP hits 249 t/s on a 24GB consumer GPU (RTX 5090M) — 3.4× the dense 27B variant on the same image

Qwen3.6 35B-A3B MTP reaches 249 t/s on RTX 5090M (24GB), 3.4× faster than dense 27B variant. MoE architecture (128 experts, ~3B active params per token) combined with MTP (86.6% draft acceptance) explains the speedup. Context scaling up to 262K tokens with minimal degradation.

Qwen Code generation Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·May 22

BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.

BeeLlama v0.2.0 delivers major performance gains with DFlash optimization. On RTX 3090: Qwen 3.6 27B reaches 164 tps (4.40x speedup), Gemma 4 31B 177.8 tps (4.93x). Full Gemma 4 31B support, reduced DFlash overhead, improved prefill handling, stricter draft/target validation.

Qwen Open source Code generation

SIG

HYP

Reddit r/MachineLearning·May 22

NuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable) [P]

Numind releases NuExtract3, a 4B open-weight VLM based on Qwen3.5-4B under Apache-2.0 license. The model extracts structured data from complex documents (PDFs, forms, tables, invoices) to Markdown or JSON. Trained for 3 days on 8xH100, it supports multiple quantizations (GPTQ, W8A8, FP8, Q4, Q6) and runs on 4GB VRAM minimum.

Vision Open source Code generation

SIG

HYP

arXiv cs.LG·May 22

Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction

AI text detectors amplify a pretrained typicality axis rather than construct an AI-vs-human boundary. On RoBERTa-base, raw projection onto centroid(AI)-centroid(HC3) achieves AUROC 0.806-0.944, matching or exceeding fine-tuning. A closed-form Jacobian predictor transfers to 16/16 third-party detectors with oracle-equivalence, reducing FPR by 57% on the OpenAI detector.

Evals Benchmarks AI safety

SIG

HYP

arXiv cs.LG·May 22

Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

Researchers train language models to forecast empirical success of research ideas before experimentation. On 11,488 idea pairs from PapersWithCode, an 8B model reaches 77.1% accuracy via SFT, outperforming GPT-5 (61.1%). RLVR approach generates interpretable justifications with 71.35% accuracy.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.LG·May 22

AgForce Enables Antigen-conditioned Generative Antibody Design

AgForce, an encoder-decoder architecture with GNN, addresses three failure modes in antibody design: antigen blindness, vocabulary collapse, and inability to generate antigen-specific sequences. Uses framework dropout, gated bottlenecks, hyperbolic attention, and Mixture Density Network. Improves amino acid recovery by 8% on CHIMERA-Bench.

Papers Benchmarks Code generation

SIG

HYP

arXiv cs.LG·May 21

Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

Chronicle is a 324M-parameter multimodal foundation model trained from scratch on natural language and time series in a unified architecture. Both modalities share the same transformer blocks and attention mechanisms. It matches Gemma-3-270M on 19 NLU tasks, sets new benchmarks on 24 UCR/UEA datasets, and outperforms supervised fusion baselines on Time-MMD.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.CL·May 21

Beyond Semantic Similarity: A Two-Phase Non-Parametric Retrieval Workflow for Corporate Credit Underwriting

Two-phase RAG system for corporate credit analysis: phase 1 combines lexical and dense multilingual retrieval; phase 2 applies adaptive controller and LLM-as-Judge scoring based on analytical utility rather than semantic similarity. On-premise deployment on proprietary multilingual corpus. Production: document review time reduced from hours to 3 minutes across 800+ analysts.

RAG Vector search Embeddings

SIG

HYP

arXiv cs.CL·May 21

Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies

Study across 11 generations of self-training on 5 models (GPT-2, Pythia, OPT). Contrary to uniform 'flattening', language restructures: surface markers (connectives, em-dashes) rise while deep syntactic structures (questions, passives, subjunctives) collapse. Structural Depth Hypothesis predicts this decay (ρ=0.540, p<10⁻⁶).

Papers Benchmarks GPT

SIG

HYP

arXiv cs.LG·May 21

Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages

Introspective Training (IXT) uses a thinking reward model to annotate data with natural language feedback from pre-training onward. On 7.5-12B LLMs trained up to 18T tokens, the method improves compute efficiency by 2.8x and achieves performance levels unattainable otherwise in math and code domains.

Reinforcement learning Reasoning Code generation

SIG

HYP

Reddit r/MachineLearning·May 20

CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution [R]

CANTANTE solves credit assignment in LLM-based multi-agent systems by decomposing global rewards into per-agent optimization signals. Evaluated on MBPP, GSM8K, and HotpotQA, it outperforms GEPA and MIPROv2 (+18.9 pts MBPP, +12.5 pts GSM8K) with no inference overhead.

Multi-agent Prompt engineering Reinforcement learning

SIG

HYP

arXiv cs.LG·May 20

In-Context Learning Operates as Concept Subspace Learning

Mechanistic study of in-context learning (ICL) showing structured demonstrations induce concept inference in low-dimensional subspaces. On Llama-3-8B, a 68–73-dimensional subspace of 4096 restores 78.8% of clean–corrupted accuracy gap, while the complementary subspace has zero effect. Results confirmed on Qwen2.5-7B and cross-lingual rule tasks.

Reasoning Llama Qwen

SIG

HYP

arXiv cs.AI·May 20

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

PRISM is a 10,372 instruction-code pair benchmark for evaluating programmatic video generation by LLMs. It proposes 4 metrics: code reliability, spatial coherence, visual complexity, and temporal density. Evaluation of 7 LLMs reveals a 41% execution-spatial gap: executable code does not guarantee spatially coherent output.

Benchmarks Code generation Video generation

SIG

HYP

arXiv cs.AI·May 20

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

IC-Q algorithm for decentralized multi-agent workflow learning under interface constraints. Each agent observes only a local function of shared artifact and private state, with no centralized access to joint trajectories. Finite-sample convergence guarantee for neural Q-learning under decentralized partial observability.

Multi-agent Reinforcement learning AI Agents

SIG

HYP

arXiv cs.AI·May 20

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

DecisionBench is a benchmark for evaluating emergent delegation in long-horizon multi-agent workflows. The substrate includes 11 models (7 vendor families), GAIA/tau-bench/BFCL tasks, and multi-axis metrics (quality, cost, latency, routing fidelity). Results show quality alone masks orchestration signals, and delivery channel dominates description content.

AI Agents Multi-agent Benchmarks

SIG

HYP

arXiv cs.LG·May 20

UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing

UCCI is an LLM cascade router using uncertainty calibration to reduce inference costs. Via isotonic regression, it maps token-level margin uncertainty to per-query error probability, then selects escalation threshold via cost minimization. On 75,000 NER queries with 4B/12B models, UCCI cuts costs by 31% while reducing calibration error from 0.12 to 0.03.

AI Agents Evals Infrastructure

SIG

HYP

arXiv cs.CL·May 20

ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

ReacTOD combines neuro-symbolic and ReAct for task-oriented dialogue. A bounded ReAct loop with symbolic validation iteratively corrects dialogue errors (93.1% self-correction rate), eliminating hallucinations and format errors. On MultiWOZ 2.1: gpt-oss-20B reaches 52.71% JGA (+14pp), Qwen3-8B 47.34%. On SGD: Claude-Opus 80.68%, Qwen3-32B 64.09%.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·May 20

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

Study of 63 base models reveals hidden phase transition: below ~3.5B parameters, reasoning and truthfulness anticorrelate; above, they cooperate. Architecture, data curation, and training recipe independently shift this critical threshold. Width normalization eliminates anticorrelation; frontier models reach r=+0.72. Open-source steering tool and diagnostic dashboard released.

Benchmarks Alignment Reasoning

SIG

HYP

arXiv cs.LG·May 20

DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training

DynaTrain is a distributed training system enabling sub-second online reconfiguration of multi-dimensional parallelism. Using a Virtual Parameter Space abstraction, it reconfigures a 70B dense model in 2s and a 235B MoE model in 4.36s, outperforming existing elastic systems by up to three orders of magnitude.

Infrastructure Reinforcement learning Papers

SIG

HYP

Reddit r/LocalLLaMA·May 19

Nemotron-Labs-Diffusion from NVIDIA

NVIDIA releases Nemotron-Labs-Diffusion, tri-mode model (AR, diffusion, self-speculation) in 3B/8B/14B sizes. Self-speculation combines diffusion drafting and AR verification with shared KV cache: 3× higher acceptance length vs Qwen3-8B-Eagle3, 2.2× speedup, 4× speedup on GB200 (1015 tok/sec with custom CUDA kernels).

Code generation Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Tongyi DeepResearch Technical Report

Tongyi DeepResearch is a 30.5B-parameter agentic LLM (3.3B activated per token) designed for autonomous long-horizon research tasks. Trained via agentic mid-training and post-training with automatic data synthesis, it achieves SOTA on Humanity's Last Exam, BrowseComp, WebWalkerQA and other benchmarks. Model, framework and solutions are open-sourced.

AI Agents Reasoning Benchmarks

SIG

HYP