Page 3 of 192

AllHigh signalRecent

7679 articles

The Weight Norm Sets the Grokking Timescale: A Causal Delay Law

Causal study on grokking: the delay before generalization depends on weight norm. Under free weight decay, networks grok at a stable critical norm Wc (CV 1–2%). When norm is clamped to ρ×Wc, delay follows T_grok ∝ exp(α·ρ) with α≈7.5 (R²=0.996 across 4 moduli). Norm controls delay 19× more than learning rate.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.CL·Jun 15

Harsher on Male? Evaluating LLMs on Gender-Asymmetric Moral Framing Across Diverse Conflict Scenarios

GAMA-Bench, a benchmark of 1,298 paired scenarios, reveals systematic asymmetry: LLMs apply harsher response standards to male actors than female actors for identical misconduct. Male actors receive more punitive and blame-centered framing, while female actors receive therapeutic and empathy-oriented responses. The pattern persists across 10 models and all scenario types.

Evals AI safety Alignment

SIG

HYP

Reddit r/LocalLLaMA·Jun 13

ZONOS2: real-time TTS with 8B params, 900M active, and high-fidelity voice cloning

Zyphra releases ZONOS2, an open-source TTS model (Apache 2.0) with 8B parameters and 900M active at inference. Sparse MoE focused on zero-shot high-fidelity voice cloning (44.1 kHz DAC). Prosody score 88.7, outperforming Qwen 3 TTS (87.6) and ElevenLabs V3 (83.2). Trained on 6M+ audio hours, reads raw UTF-8 without phonemizer.

Voice Open source Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 12

MiniMax Sparse Attention (MSA)

MiniMax introduces MSA (Sparse Attention), a blockwise sparse attention built on GQA for ultra-long contexts (up to 1M tokens). On a 109B multimodal model, MSA reduces per-token attention compute by 28.4x at 1M context, with 14.2x prefill and 7.6x decoding speedups on H800. Code and MiniMax-M3 model released.

Reasoning Infrastructure Open source

SIG

HYP

Reddit r/LocalLLaMA·Jun 12

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

InfiniteKV compresses KV cache into 104-byte searchable records stored in RAM or disk instead of deleting old tokens. Mistral-7B correctly answers at token 76,747 (2.3× its 32,768 training window). One million tokens requires ~3 GB instead of 122 GB.

Open source Infrastructure Llama

SIG

HYP

arXiv cs.CL·Jun 12

MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction

MARD is a 7B parameter model for mechanism-level drug-drug interaction prediction (enzyme, pharmacodynamic axis). Uses reasoning distillation with process-reward-weighted DPO and mechanism-aware retrieval. On April-2026 DrugBank: +13.9pp over best baseline, +6.7pp over GPT-4o, with robust generalization to unseen drug pairs.

Reasoning Fine-tuning Reinforcement learning

SIG

HYP

arXiv cs.CL·Jun 12

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

LEDGER is a benchmark of 4,999 digitized corporate annual reports to evaluate LLM long-context capabilities in finance. The corpus includes 31 consolidated financial KPIs, 118,048 TREC-style retrieval questions, and extraction tasks on numerically dense documents. Case study: correlation between CEO rhetoric and post-publication market impact.

Benchmarks RAG Reasoning

SIG

HYP

arXiv cs.CL·Jun 12

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

Rigel empirically characterizes Metal 4.1 tensor compute path on Apple M4 Max. Researchers find fp8 (E4M3) matmul2d is emulated, not accelerated (0.94x fp16 throughput), executes on GPU shader cores without dedicated matrix datapath, and accumulates in ≥fp32. Hand-fused GEMM+bias+GELU kernel gains +6.5-12.9% in cache-resident regime.

Benchmarks Infrastructure Code generation

SIG

HYP

arXiv cs.AI·Jun 12

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

OpenMedQ is a medical vision-language model pretrained on 14 datasets (~3.35M samples) covering pathology, radiology, microscopy, and clinical QA. It achieves 75.9 BLEU-1 on PathVQA (outperforming Med-PaLM M 562B) and 0.757 average macro-F1 on 8 unseen medical classification benchmarks.

Vision Benchmarks Open source

SIG

HYP

arXiv cs.AI·Jun 12

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Arbor is a multi-agent framework introducing tree search as a cognition layer for autonomous agents. Validated on full-stack LLM inference optimization, it pairs an Orchestrator agent with a Critic agent in a checks-and-balances architecture. Arbor achieves 193% throughput-latency Pareto improvement over vendor-optimized baselines, versus 33% for a single agent that crashes within hours.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.AI·Jun 12

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Pythagoras-Prover is an open-source family of efficient Lean theorem provers (4B and 32B parameters, including a diffusion-based prototype). Via curriculum SFT and Augmented Lean Formalisation (ALF), the 4B model outperforms DeepSeek-Prover-V2-671B on MiniF2F-Test (86.1% vs 82.4%) with 167x fewer parameters. The 32B achieves 93.0% on MiniF2F-Test and solves 93/672 PutnamBench problems.

Reasoning Code generation Benchmarks

SIG

HYP

arXiv cs.CL·Jun 11

AI Coding Agents Can Reproduce Social Science Findings

SocSci-Repro-Bench, a benchmark of 221 tasks in social sciences, evaluates AI agents' ability to reproduce published findings. Claude Code substantially outperforms Codex, with reproduction rates exceeding previous LLM-based agent benchmarks. Agents also perform strongly on reasoning tasks identifying research questions and show results are not primarily driven by memorization.

Claude Code Benchmarks Code generation

SIG

HYP

arXiv cs.AI·Jun 11

INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

INFRAMIND is a framework for multi-agent orchestration that integrates real-time infrastructure state (GPU queue depths, KV-cache pressure, latencies). Via adaptive planning, per-step routing, and intelligent scheduling, it optimizes model selection and topologies under concurrent load. Results: +7.6pp accuracy gain at low load, 7x lower latency, 99.9% SLO compliance under high load.

Multi-agent AI Agents Reinforcement learning

SIG

HYP

arXiv cs.LG·Jun 11

ProHiFlo: Hierarchical Flow Matching with Functional Guidance for De Novo Protein Generation

ProHiFlo is a hierarchical flow matching framework for de novo protein generation. It combines coarse-to-fine generation (backbone then atoms), functional guidance via pretrained predictors, and SE(3)-equivariant architecture. On enzyme active site scaffolding, ProHiFlo achieves 58.9% success rate vs 41.2% for RFDiffusion, with 4× fewer sampling steps.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.CL·Jun 11

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

ISE is a three-stage synthesis paradigm for generating multi-turn OS-agent trajectories with live execution. 43,956 structured intents, 23,132 trajectories (avg 8.12 user turns), execution in isolated OS workspace. Fine-tuning Qwen3-8B on ISETrace: ClawEval 19.3→37.7 pass@1, outperforms zero-shot GPT-4o and Qwen3-32B.

AI Agents Benchmarks Code generation

SIG

HYP

arXiv cs.CL·Jun 11

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

First end-to-end RAG pipeline running all neural stages on mobile NPU (Snapdragon X Elite Hexagon). Embedding, reranking, LLM generation on-device. On 120-query Wikipedia benchmark: 18.1x faster LLM prefilling, 4.0x lower system energy vs CPU, answer quality parity (GPT-4.1 judge: 9.32 vs 8.95 CPU).

RAG Embeddings

SIG

HYP

Reddit r/LocalLLaMA·Jun 10

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

FlashMemory-DeepSeek-V4 introduces Lookahead Sparse Attention (LSA), an inference paradigm reducing KV cache footprint to 13.5% of baseline on ultra-long contexts (500K tokens). A Neural Memory Indexer predicts future context demands and preserves only query-critical chunks in GPU memory, without loading the full backbone model. Results: +0.6% average accuracy on LongBench-v2, LongMemEval, RULER.

DeepSeek Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 10

CodeAlchemy: Synthetic Code Rewriting at Scale

CodeAlchemy generates 500B+ synthetic tokens via 5 strategies (CodeEnhance, CodeQA, CodeDev, CodeDialogue, CodeTrace) from public code across 15 languages. CodeTrace instruments 1.3M+ files to capture control flow and library knowledge. 3B models outperform 10x larger models (Gemma-3 27B, Granite-4.0 32B): 83.5% HumanEval, 63.2% MBPP.

Code generation Benchmarks Fine-tuning

SIG

HYP

arXiv cs.AI·Jun 10

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

ComBench is a benchmark of 100 Olympiad-level combinatorics problems to evaluate LLM mathematical reasoning. It distinguishes analysis-centric problems (rigorous proofs) from construction-centric problems (explicit constructions). Top models reach 65.4% average and 75.3% Best@4. Kimi-K2.6 outperforms GPT-4o on constructions but trails on proof grading.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 10

Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

Engram, an open-source dual-process memory engine for LLM agents, uses a bi-temporal knowledge graph to outperform full-context baselines. On LongMemEval_S (500 questions), the lean configuration retrieves ~9.6k tokens and achieves 83.6% vs 73.2% for full history (+10.4 points, p<10^-6), using 8x fewer tokens.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 10

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

IntentKV is a KV cache pruning technique for multi-turn LLM agents. It maintains cross-turn intent memory and uses memory-attention rules to score historical tokens. On Qwen2.5-14B with 8k budget, it reduces peak request tokens from 92.3k to 20.5k (−77.8%) and KV reads from 411M to 31M (−92.6%) with minimal accuracy loss.

AI Agents Reasoning Infrastructure

SIG

HYP

Simon Willison·Jun 9

Initial impressions of Claude Fable 5

Anthropic releases Claude Fable 5 and Claude Mythos 5 with 1M token context window, 128k max output tokens, January 2026 knowledge cutoff. Fable 5 includes strict safety guardrails; Mythos 5 without safety classifiers. Pricing: $10/M input, $50/M output (2× Opus 4.5-4.8). Willison reports strong performance after 5.5 hours of testing.

Claude Anthropic Benchmarks

SIG

HYP

arXiv cs.LG·Jun 9

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

ResearchClawBench benchmarks autonomous scientific research agents across 40 tasks spanning 10 scientific domains. Claude Code scores 21.5/100, Claude-Opus 20.7/100. Failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core.

Benchmarks AI Agents Claude

SIG

HYP

arXiv cs.AI·Jun 9

UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL

UniQL is a benchmark of 24,544 SQL queries across 16 dialects (MySQL, PostgreSQL, T-SQL, etc.) to evaluate LLM generalization in text-to-SQL tasks. Experiments show current LLMs fail to generalize across dialects, with substantial performance variation across database systems.

Benchmarks Code generation Evals

SIG

HYP

arXiv cs.LG·Jun 9

Offline Reinforcement Learning for Plasma Control in Nuclear Fusion: Codebase and Benchmark

RL4F is an open-source offline reinforcement learning benchmark for plasma control in nuclear fusion. Built on historical data from the DIII-D tokamak, it evaluates imitation learning and offline RL methods on four multi-actuator tracking tasks (rotation, density, temperature, pressure). Offline model-based RL methods achieve best average performance.

Reinforcement learning Benchmarks Open source

SIG

HYP

Reddit r/LocalLLaMA·Jun 9

ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp

llama.cpp PR improves matmul performance for k-quants via WebGPU. Speedups measured on M2 Pro: Q2_K 2.44x, Q3_K 3.27-3.78x, Q4_K 1.34-1.36x, Q5_K 1.33x, Q6_K 1.44-1.52x in prefill (pp512).

Open source Infrastructure Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 9

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

Fine-tuned Parakeet 0.6B for medical transcription, open weights (CC-BY-4.0). Omi Med STT v1 achieves 2.37% M-WER (clinical term errors) vs 8.36% baseline, 145× RTFx. Multi-platform runtime (MLX/NeMo/GGUF). Benchmark on 1,513 medical clips: outperforms Whisper Large v3 Turbo and Qwen3 ASR on clinical accuracy.

Open source Code generation Benchmarks

SIG

HYP

arXiv cs.AI·Jun 8

AEGIS: A Backup Reflex for Physical AI

AEGIS detects high-risk steps in long-horizon robot manipulation by probing frozen activations of a weak policy. Upon detection, control switches to a stronger policy for only those steps. On LIBERO-Spatial, AEGIS recovers 10.1% of lost trajectories (vs 4.6% for blind escalation), activating the stronger policy on only 38% of steps.

Robotics Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 8

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

HKJudge is the first sentence-level expert-annotated legal discourse corpus. It contains ~290k sentences and ~6.5M tokens from Hong Kong criminal judgments across all court levels, annotated by legal linguistics experts. Two benchmark tasks: rhetorical role classification (26 categories) and legal element extraction. Evaluation on BERT models, open-source and commercial LLMs.

Benchmarks Papers Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 8

Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning

PolyFact, a 100K multilingual factual QA dataset grounded in Wikidata across 12 languages, evaluates three approaches to improve cross-lingual factual consistency in Qwen-2.5-7B and OLMo-2-1124-7B. GRPO outperforms supervised fine-tuning by reducing language specialization in MLP layers and attention heads, promoting shared cross-lingual representations.

Benchmarks Reinforcement learning Qwen

SIG

HYP

arXiv cs.LG·Jun 8

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

MacArena is a benchmark of 421 tasks across 50 macOS applications, evaluating computer-use agents on native Apple Silicon environments. Results show leading models drop 26% performance on macOS-native tasks, revealing that existing benchmarks fail to capture genuine cross-platform GUI complexity.

AI Agents Benchmarks Vision

SIG

HYP

arXiv cs.AI·Jun 6

Agents' Last Exam

Agents' Last Exam (ALE) is a benchmark evaluating AI agents on long-horizon, economically valuable real-world tasks. Developed with 250+ industry experts, it covers 1K+ tasks across 13 industry clusters in non-physical sectors. Average full pass rate is 2.6% on the hardest tier.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.AI·Jun 6

LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

LeanMarathon is a multi-agent system for reliable research-level autoformalization in Lean. It uses an evolving blueprint (Lean file serving as proof skeleton, natural-language proof graph, and shared record) coordinated by four specialized agents. On two recent papers spanning four Erdős problems, it formalizes seven target theorems with no sorry and proves 258 lemmas.

Reasoning AI Agents Multi-agent

SIG

HYP

Reddit r/MachineLearning·Jun 5

TinyTPU: SystemVerilog systolic array compiled to WASM, running live in browser - RTL golden-verified against numpy [P]

TinyTPU is a 4×4 weight-stationary systolic array in SystemVerilog compiled to WebAssembly with step-by-step browser visualization. Users enter two matrices and watch actual hardware execution: weights loading into processing elements, matrix A streaming diagonally, partial sums accumulating, results draining. Three levels: single MAC cell, full 4×4 array matmul, and tiling for larger matrices.

Infrastructure Benchmarks Open source

SIG

HYP

arXiv cs.LG·Jun 5

Alpha-RTL: Test-Time Training for RTL Hardware Optimization

Alpha-RTL introduces TTT-RTL, a test-time reinforcement learning framework for LLM-based RTL generation optimization. On RTLLM v2.0 (Nangate 45nm), TTT-RTL reduces PPA product by 65.1% versus reference and outperforms frozen-policy baselines by 26.1%. On XuanTie C910 FPU (Sky130), achieves 59.4% ADP reduction. Adaptive KL-budget controller stabilizes policy updates.

Code generation Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 5

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

Longitudinal audit of sycophancy across six Gemini variants (2.0, 2.5, 3.0) on 73 adversarial prompts. 27.2% of responses contain substantial sycophantic content (Likert ≥2), masked by binary metrics. Gen 2.5 regresses (2.64 vs 1.90 Gen 2.0), Gen 3.0 recovers (2.01). Strong negative correlation (rho=-0.63) between sycophancy and truthfulness.

Gemini AI safety Alignment

SIG

HYP

arXiv cs.CL·Jun 5

LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

LANTERN is a lightweight memory layer that archives every conversation turn and restores relevant details after compaction via hybrid retrieval, requiring zero LLM calls and adding <25ms latency per turn. On 94 multi-turn conversations (1,894 validated facts), LANTERN-Rerank recovers 78.3% of lost facts, significantly outperforming MemGPT (72.4%, p<0.0001) at a fraction of inference cost.

RAG Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 5

CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

CHASE is a co-evolutionary red-blue teaming framework training an attacker and defender via GRPO to improve LLM robustness against prompt-rewriting attacks (persona modulation, fictional framing). Evaluated on BeaverTails and JailbreakBench, it reduces StrongREJECT score by 43.2% with 0% false refusals on benign prompts.

AI safety Alignment Reinforcement learning

SIG

HYP

arXiv cs.LG·Jun 5

Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs

A study reveals that in zeroth-order (ZO) optimization for LLM fine-tuning, a single decoding layer dominates adaptation. Fine-tuning this dominant layer alone matches or exceeds full-model ZO fine-tuning on LLaMA2-7B and Qwen3-8B, with speedup up to 4.52×. The dominant layer is identifiable before training via activation-outlier analysis.

Fine-tuning Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 5

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

CVT-RL, a policy-gradient algorithm with dense verifiable rewards, improves long-horizon language agent RL. On QA, ALFWorld, ScienceWorld, and web/tool tasks, task success rises from 71.8% (non-causal RL) to 78.9%, evidence F1 from 78.9 to 82.8, and measured hacking from 7.2% to 3.9%. Statistical tests yield p<0.01 after Holm correction.

Reinforcement learning AI Agents Reasoning

SIG

HYP