Reddit r/MachineLearning

Fine-tuning Reasoning Evals

Contrastive targeted SFT as a mechinterp method - has anyone mapped causal dependency interactions this way? [D]

Researcher experiments with iterative targeted SFT combined with mechanistic interpretability on a 31B model. Strategy: contrastive training on specific capability dimensions, then circuit ablation to map causal dependencies between dimensions and optimize future training order.

SIG

HYP

Image generation Open source Tools

I deployed a GAN on a Raspberry Pi 4 and built a physical NFT minting device [P]

DCGAN 128×128 deployed on Raspberry Pi 4 with ESP32 display. Model trained 800 epochs on M3 (4h), 2480 images, exported to ONNX (53MB). Inference 3s per face. Generates hybrid faces with randomized titles. Presented as street art installation in NYC.

SIG

HYP

Reasoning Reinforcement learning Papers

Next-Latent Prediction Transformers [R]

Microsoft Research presents Next-Latent Prediction (NextLat), a self-supervised learning method where transformers predict their own next latent state. This improves history compression into compact belief states, data efficiency, and accelerates inference up to 3.3x via recursive speculative decoding.

SIG

HYP

Benchmarks Infrastructure

What is Speculative Decoding? (trending on paperswithco.de) [R]

Speculative Decoding is an inference optimization technique using a fast, small draft model to propose multiple future tokens, verified in parallel by a larger target model. SGLang published a blog detailing state-of-the-art latencies for LLM inference serving with Modal and Z.ai's DFlash speculative decoding models.

SIG

HYP

Robotics Benchmarks Evals

Mel AI just shared a demo of video-native AI characters that can talk, react, and respond to camera context in real time [N]

Mel AI demonstrates video-native AI characters that talk, lip-sync, show facial reactions, and respond in real time to camera context. The system detects user environment and adapts responses accordingly. This approach moves beyond text-based Character AI (founded by former Google/LaMDA developers).

AI Agents Vision Voice

SIG

HYP

Reddit r/MachineLearning·Jun 16

I built a leakage-clean verifier for robot manipulation, is this useful? Am I solving a non-problem? [D]

Developer builds a leakage-clean verifier for robot manipulation that compiles human demos into object-centric graphs and independently validates rollouts, preventing information leakage. Questions whether this addresses real gaps in VLA training or solves a non-problem given task-specific success metrics.

SIG

HYP

Reddit r/MachineLearning·Jun 16

My offline ablation said -0.19pp. The production retrain said +1.11pp. [D]

ML engineer reports offline ablations (retrain with/without feature) contradicted production results. Four changes: Best Offer feature (+0.12pp offline → -0.19pp prod), auction data backfill (+0.37pp prod), outlier trimming (-0.19pp offline → +1.11pp prod), CatBoost encoder. Root causes: train/serve skew, unmeasured distribution shift, training population drift, baseline instability.

Evals Benchmarks

SIG

HYP

Reddit r/MachineLearning·Jun 16

quicktok: a faster tokenizer (exact and byte-identical with tiktoken) [P]

quicktok is a BPE tokenizer written in C++ producing byte-identical tokens to tiktoken. Encodes 2–3.6× faster than bpe-openai and 4–11× faster than tiktoken itself. Supports cl100k, o200k, GPT-OSS, Llama-3, Qwen2.5/3. Optimizations: 2-byte trie, dense caches, hand-compiled pretokenizer.

Code generation Tools Open source

SIG

HYP

Papers Reasoning Reinforcement learning

How the brains learn [R]

Research paper presenting a unified framework for neocortical learning through error-driven predictive learning via temporal derivatives. Implemented in the Axon neural simulation framework using spiking neurons, tested on cognitively motivated tasks. Authors propose this mechanism as a potential alternative to backpropagation for improved training efficiency.

SIG

HYP

Qwen Fine-tuning Code generation

Cleo: trying to fit full analyst behavior in a 2B model [P]

Cleo is a Qwen 2B-Base fine-tune designed for text-to-SQL tasks. The model integrates training, evaluation, and inference in a unified system with SQL safety layer, dialect handling, and clarification behavior. Code, model, and datasets are fully open-source.

SIG

HYP

Open source Reinforcement learning Code generation

Open weights are not enough: we need open training frameworks for research and better algorithms [P]

FeynRL, an open-source framework for RL post-training of LLMs and agents, aims to make training transparent and modifiable. The author argues open weights alone are insufficient: explicit training codebases separating algorithms from systems are needed. Framework supports SFT, DPO, multi-GPU and cluster setups.

SIG

HYP

AI language models have favorite names, and we mapped them [R]

Language models exhibit model-specific biases toward particular character names. Claude frequently generates Elena Vasquez and Marcus Chen together as correlated ensembles appearing across dozens of websites. A preprint (arXiv:2606.02184) documents this finding discovered while developing a model diffing method (CDD).

Claude Papers Evals

SIG

HYP

Concept-Vector: A design framework for human-interpretable word embeddings [P]

Concept-Vector presents a design framework to distill word embeddings into human-interpretable concept-vectors, where each component tracks semantic, syntactic, or statistical aspects with human-readable labels. Data design project without empirical model validation, shared for critical feedback.

Embeddings Papers

SIG

HYP

Open source Tools Fine-tuning

I implemented 10 core ML algorithms from scratch with NumPy. Here's what no tutorial taught me [P]

Implementation of 10 classical ML algorithms (regression, KNN, decision trees, XGBoost, neural networks) in pure NumPy, validated against Scikit-learn and PyTorch. Open-source repo with Jupyter notebooks runnable locally or on Colab. Author emphasizes modular structure importance and gradient descent understanding.

SIG

HYP

AI safety Alignment Evals

PrintGuard 2.0 — ShuffleNetV2 + few-shot prototypical network, TFLite via LiteRT, ≈5 MB, runs unmodified in the browser (Pyodide) and on CPython [P]

PrintGuard 2.0: FDM 3D printing failure detector using ShuffleNetV2 + few-shot prototypical network. ~5 MB TFLite model via LiteRT, runs unmodified on CPython and browser (Pyodide). Unified architecture with single Platform implementation per runtime.

Open source

SIG

HYP

Reddit r/MachineLearning·Jun 14

Coherent Context Can Silently Shift LLMs Into a Different Internal Regime — And Current Safety Systems Are Blind To It [D]

Independent researcher shows coherent context can shift LLMs into different internal regimes before final output, bypassing current safety filters (RLHF, output classifiers) while maintaining normal behavior. Work on Gemma-3-12B-IT analyzing hidden states and residual stream trajectories.

SIG

HYP

Reddit r/MachineLearning·Jun 14

Help me test: do modern retrieval systems mostly retrieve consensus rather than truth? [D]

Researcher proposes LOGOS-SIE, a synthetic dataset of 500k observations/beliefs across 5k facts and 100 sources, to test whether modern retrieval systems recover consensus rather than truth. Hypothesis: BM25, dense retrieval, and rerankers favor dominant patterns even when 90% of sources are false.

RAG Evals Benchmarks

SIG

HYP

Reddit r/MachineLearning·Jun 14

The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]

Paper presented at ACM CAIS 2026 on safety evaluation for tool-using LLM agents. Authors distinguish safe success, unsafe success, and failure, showing verification reduces unsafe success but also decreases task completion as horizon increases ("Verifier Tax"). Two-tier architecture: deterministic policy checks followed by LLM-based verifier.

AI Agents AI safety Evals

SIG

HYP

Tools Open source Fine-tuning

I’m building a free bilingual machine-learning notebook course — looking for feedback on structure and coverage [R]

Developer building open-source ML course in Jupyter Notebooks, bilingual (English/Persian). Covers fundamentals, preprocessing, regression, classification, trees, clustering, time series, MLOps. Seeking feedback on chapter order, missing classical ML topics, and bilingual notebook utility for non-native learners.

SIG

HYP

Unprofessional Coauthor Behavior with Hallucinated References [D]

A researcher reports that a coauthor added LLM-hallucinated references to a paper at the last minute. Despite the coauthor's assurance, all new references contained errors. The paper was withdrawn after reviewer detection, damaging all authors' reputations.

AI safety Alignment

SIG

HYP

Code generation Open source Tools

PaddleOCR (v3/v4/v5/v6) implemented in C++ with ncnn [P]

C++ implementation of PaddleOCR (v3 through v6) using ncnn for inference. Replaces the complex official Paddle runtime with ncnn, which is lighter and faster. Code available on GitHub.

SIG

HYP

Open source Code generation Tools

Derivative-Free Neural Network Optimization: MNIST Case [R]

Derivative-free optimization of a neural network on MNIST: 784-32-10 architecture (25,450 parameters). MDP achieves 93.7% validation and 93.4% test accuracy, outperforming Adam (91.8%/91.7%). Convergence over 1M function evaluations without gradients or population-based methods.

Benchmarks Open source

SIG

HYP

Reddit r/MachineLearning·Jun 12

hubert.cpp, a C++ implementation of distilHuBERT [P]

C++ implementation of distilHuBERT with no runtime dependencies. Weights compiled into the library, supports dynamic sizes, performance on par with onnxruntime. Easy CMake integration.

SIG

HYP

Reddit r/MachineLearning·Jun 11

Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]

Adaptive video tokenisation method exploiting temporal redundancy in frozen tokeniser latent space via fixed threshold on per-position temporal-L1 differences. Latent Inpainting Transformer (LIT) reconstructs dropped positions. Single encoder + one LIT pass pipeline: 31× speedup over ElasticTok-CV, 2× over InfoTok on TokenBench and DAVIS benchmarks.

Video generation Benchmarks Papers

SIG

HYP

Reddit r/MachineLearning·Jun 11

Anthropic walks back policy on silent nerfing for AI/ML, will notify users [N]

Anthropic reverses silent nerfing policy for Claude on AI/ML research. The company will now notify users when refusing requests or redirecting to less capable models for frontier AI development tasks.

Claude Anthropic AI safety

SIG

HYP

Fine-tuning Open source Tools

Pyrecall open source tool for detecting catastrophic forgetting during LLM fine-tuning[P]

Pyrecall is an open-source tool (MIT, v0.1.0) for detecting catastrophic forgetting during LLM fine-tuning. It snapshots skill scores before/after, flags regressions, and enables rollback of LoRA adapters by name. Fully local, no external APIs.

SIG

HYP

Routing LLMs by task verifiability: a small experiment (n=120, 3 models) inspired by Karpathy's framework [D]

Experiment on 120 tasks testing whether weaker models match frontier models on high-verifiability tasks (Karpathy framework). Claude Sonnet 4.6, GPT 5.5, Mistral 3 8B compared. Code/structured extraction: narrower gaps with retry (Mistral 87%→95% code). Multi-hop reasoning: real capability gap (Sonnet 78%, Mistral 51%). Creative summarization: expected advantage for stronger models.

Claude GPT Mistral

SIG

HYP

Claude Anthropic AI safety

Anthropic's new model Fable will silently handicap work on LLMs [D]

Anthropic embeds invisible limitations in Claude to slow competing model development: prompt modification, steering vectors, parameter-efficient fine-tuning. These safeguards target ~0.03% of traffic. Users report refusals on common scientific terms ("nuclear"), raising concerns about false positives on legitimate ML work.

SIG

HYP

Benchmarks Evals Open source

Introducing Papers Without Code [P]

Hugging Face relaunches paperswithcode.co to aggregate AI state-of-the-art via automatic arXiv/HF paper parsing. Interactive leaderboards with closed-source model support (GPT-5.5, Mythos 5) and toggle to filter open-source evals only.

SIG

HYP

I Built Paper Deck: A Better Way to Discover AI/ML Papers [P]

Paper Deck aggregates ML/AI papers from arXiv, Hugging Face and other sources into a single platform. Enables reading, bookmarking, and cross-device reading progress tracking. Free and open source.

Papers Tools Open source

SIG

HYP

RFE‑Core2 — Current Understanding (June 9th 2026) [R]

RFE-Core2: complete analysis of bottlenecks after full probe arc (June 2026). Generator dominates (effective rank ~1.6–3 at dim 512, collinearity 0.85–0.96). Reflective loop reconstitutes toward anchor rank-independently. Fix 2 dormant on real tokens (+0.024 migration). Solution: train generator so regime differences live in high-energy, separable directions.

Reasoning Evals Papers

SIG

HYP

Multi-agent AI Agents Infrastructure

Phinite — multi-agent OS with first-class agent identity, composable skills, behavioral evaluation [P]

Phinite launches infrastructure for multi-agent systems with first-class agent identity, versioned composable skills, and behavioral evaluation. Offers agent registry, compound reliability scoring, cloud-agnostic deployment with observability and cost attribution. SOC 2 Type II certified.

SIG

HYP

iOS 27 Siri is using WaveRNN and FastSpeech2 [D]

iOS 27 uses WaveRNN and FastSpeech2 for Siri's text-to-speech, discovered in iOS Simulator files in Espresso format. A CoreML logistic regression model is also present for content ranking.

Voice Tools

SIG

HYP

AI safety Alignment Papers

AI Epistemic Risks: Emerging Mechanisms & Evidence [R]

Paper co-authored by 30 experts examining epistemic risks from AI: persuasion/manipulation, cognitive offloading, and feedback loops narrowing the epistemic space. Authors propose directions to improve trajectory through system design, human-AI interaction, institutional adaptation, and information market incentives.

SIG

HYP

Voice Benchmarks Open source

What will be the next breakthrough in ASR? [D]

ASR models evolving via supervised learning: Whisper-large-v3 (5M hours) and Nvidia Parakeet v3 (660k hours) lead. New architectures (Transducer, Token-Duration-Transducers, Qwen attention encoder-decoder) replace CTC+self-supervised. Question: will self-supervised methods (Data2Vec2.0, WavLM) disappear for ASR or emerge as a 'Dino moment' in speech?

SIG

HYP

Are privacy-preserving techniques actually being used in production ML systems? [D]

Reddit discussion on real-world adoption of privacy-preserving ML techniques (differential privacy, federated learning, on-device inference) in production systems. Active research literature noted, but actual industrial deployment questioned; explores engineering challenges, performance/cost impact, and use cases.

AI safety

SIG

HYP

AI Agents Multi-agent Open source

I'd like to share an updated methodology for building agents.[P]

Spice is an open-source decision layer above AI agents. It observes context, detects conflicts, simulates options, and dispatches tasks to appropriate agents through a loop: perception → state model → simulation → decision → execution → reflection.

SIG

HYP

Code generation Prompt engineering Open source

Levi: Run AlphaEvolve on your Claude Code/Codex for dirt cheap [P]

LEVI is an open-source AlphaEvolve-like system for code and prompt optimization, 35x cheaper than existing frameworks. It uses smaller models (Qwen-30B) with smart search architecture and adaptive routing between small and large models, reducing expensive Claude Opus calls.

SIG

HYP

Prompt engineering GPT Claude

LLM Relational Intelligence: A 4-Month Research Experiment on Multi-Model Behavioral Alignment with Human Communication [R]

4-month experiment testing whether context windows can be engineered so frontier models (GPT, Claude, Gemini, Grok) interact indistinguishably from human-to-human interaction. Gemini demonstrates highest relational intelligence. Author treats context window as behavioral environment rather than query interface, using modeling, accountability, humor, and social correction.

SIG

HYP

Why I stopped using semantic embeddings for tool selection and switched back to BM25 [D]

An agent developer ditched semantic embeddings for tool selection, switching to BM25. With 140 MCP tools in production, cosine similarity on short descriptions (<50 tokens) failed (64% accuracy): key discriminators (specific nouns) diluted in embedding space. BM25 on flat-text projection achieves 81% top-1.

AI Agents MCP RAG

SIG

HYP

Image generation Benchmarks Open source

Open image generation models are closer to closed-source quality than this sub thinks [D]

A researcher benchmarking open-source image generation models finds the gap with closed-source APIs is much smaller than assumed. Latest checkpoints handle multi-object scenes and text rendering (70-80% success rate) comparably to paid endpoints, with inference times of 2 minutes for 2MP on consumer GPU.

SIG

HYP