Nvidia and Hugging Face released Nemotron-Labs, a family of diffusion-based language models that parallelizes token generation instead of decoding left-to-right. The core claim is latency: by breaking the sequential dependency of autoregressive decoding, the approach targets what the paper calls "speed-of-light" throughput. This lands the same week the club-rdna16 community published a reproducible benchmark repo for AMD 16 GB GPUs (RX 6900 XT, RX 7800 XT) using llama.cpp/ROCm, Qwen 27B and 35B-A3B, 131k-token context, and q8 KV cache profiles. Both signals point the same way: low-latency local inference is no longer gated on high-end NVIDIA hardware, and non-autoregressive architectures are starting to have testable implementations.
On the memory-efficiency front, SM1 — a Mamba1 variant with d_state=1 in pure PyTorch on Blackwell (RTX 5060 Ti) — cuts scan memory by 16× versus standard Mamba1 and holds a 14 KB inference state for a 130M-parameter model. The solution is exact (closed form, not an approximation) and was trained on 2.5B MIDI tokens. It is not a production model, but it demonstrates that SSMs can be reformulated with two native PyTorch ops without custom CUDA kernels — which meaningfully lowers the entry cost for experimenting with these architectures.
On data and evaluation: the empirical RAG chunking study across three production sites (Intercom, HubSpot, KPMG) shows corpus quality varies sharply by source — 31–32% HIGH/MEDIUM chunks at Intercom and HubSpot, 8% at KPMG — and that tier-weighting (HIGH ×1.20) does meaningfully rerank top-k results. The proposed "yield score" as a pre-generation corpus quality metric is directly actionable. LQS v3.1 applies the same logic to training data: 19 dimensions, 7-oracle consensus with real-signal recalibration, offline-verifiable Ed25519 certificates, 263 publicly indexed datasets. Both projects converge on the same observation: data quality remains the least instrumented lever in the AI pipeline, and open tooling is starting to close that gap.
Nvidia and Hugging Face introduce Nemotron-Labs, diffusion-based language models to accelerate text generation. The approach parallelizes token generation, reducing latency compared to traditional autoregressive methods.
GitHub repo for testing local LLMs on 16GB AMD GPUs (RX 6900 XT, RX 7800 XT, etc.). Practical benchmarks with llama.cpp/ROCm: Qwen 27B and 35B-A3B, context up to 131k tokens, q8 KV cache profiles, throughput and retrieval measurements. Reproducible configurations and call for community contributions.
Empirical RAG study on 3 production websites (Intercom, HubSpot, KPMG) with tiered chunking and embeddings. Results: 31% HIGH/MEDIUM chunks for Intercom, 32% HubSpot, 8% KPMG. Tier weighting (HIGH ×1.20) reranks top-k. Proposed metric: 'yield score' predicts corpus quality before generation.
Mamba1 variant called SM1 with d_state=1 using two native PyTorch ops to replace selective scan. Exact closed-form solution, not an approximation. Reduces scan memory 16x versus Mamba1 (d_state=16). Inference state 14 KB for 130M model, O(1) per token. Training on 163K MIDI files (2.5B tokens).
LQS v3.1 is an open-source methodology for rating AI training data quality. It uses 19 dimensions (label correctness, contamination, equity, etc.), multi-oracle consensus (7 oracles) with real-world outcome recalibration, and offline-verifiable Ed25519 certificates. Free public index with 263 scored datasets.