June 2026

2731 articles

GLM-5.2 is now 1st on Design Arena — ahead of the now unavailable Claude Fable 5.

GLM-5.2 reaches 1st place on Design Arena benchmark, surpassing the now-unavailable Claude Fable 5. Zhipu AI's model leads the design evaluation leaderboard.

Benchmarks Qwen

SIG

HYP

Google DeepMind·Jun 16

Unlocking UK house-building with AI-accelerated planning

Google DeepMind partners with UK government on an AI-powered prototype to accelerate housing planning decisions.

DeepMind Tools

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Minimax M3 (4 bit MLX) Initial Benchmark on Mac Studio M3u 512gb

Minimax M3 4-bit MLX benchmark on Mac Studio M3 512GB. Results: TTFT 3.1s (pp1024/tg128), throughput 147.7 tok/s, peak memory 226.6GB. Continuous batching: 1.83x speedup at 4 parallel requests (49.9 tok/s).

Benchmarks Open source Infrastructure

SIG

HYP

Hacker News (AI)·Jun 16

GitHub Models is no longer available to new customers

GitHub Models, GitHub's service for accessing AI models, is no longer available to new customers. The platform has closed to new signups.

Business

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

GLM-5.2 just dropped open weights and it already looks weirdly strong for coding

GLM-5.2 released with open weights under MIT license. 1M context window, two reasoning effort modes, strong coding arena performance. Open-source model unlike API-only alternatives.

Qwen Open source Code generation

SIG

HYP

Hacker News (AI)·Jun 16

Lexar Wants to Offload Local AI Models to SSD Amid the RAMpocalypse

Lexar proposes storing local AI models on SSD instead of RAM to bypass memory constraints. The strategy aims to reduce hardware costs and enable AI inference on devices with limited RAM.

Infrastructure Tools

SIG

HYP

Hacker News (AI)·Jun 16

DeepSeek V4 Pro at 5% the cost of Claude – what it takes to close the gap

DeepSeek V4 Pro delivers Claude-comparable performance at 5% of the cost. The article examines technological and economic gaps between models, lacking precise benchmark figures or exact pricing details.

DeepSeek Claude Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

GLM 5.2 API is live, weights are on HF, and ollama has it already

GLM-5.2 API live at $1.4/M input tokens, $4.4/M output. Weights released MIT-licensed on HuggingFace, Ollama support available. Benchmarks: 81.0 Terminal-Bench 2.1, 62.1 SWE-bench Pro, 74.4 FrontierSWE. 1M context window, two thinking modes (High/Max).

Open source Code generation Benchmarks

SIG

HYP

The Decoder·Jun 16

Microsoft's Copilot Cowork moves to usage-based billing and may tap DeepSeek

Microsoft is considering a fine-tuned version of DeepSeek V4 as a cheaper model option for Copilot Cowork. The company is also switching to usage-based billing, with Copilot head Charles Lamanna stating flat-rate pricing is unsustainable.

DeepSeek Business AI Agents

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Get in here: Community model build thread

A Reddit thread proposes building a community model through distributed compute using a Mixture-of-Experts (MoE) approach. The 'Branch-Train-Stitch' strategy distributes a dense prototype model to participants who train it independently on their hardware, then merge the submodels into an MoE. Key decisions include prototype size (2B or 7B) based on available VRAM.

Open source Fine-tuning

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

GLM-5.2 is the first open-weights model to cross 80% on Terminal-Bench and beats every other open model available

GLM-5.2 becomes the first open-weights model to exceed 80% on Terminal-Bench, outperforming all other open models and Gemini. Frontier-level performance at reduced cost.

Qwen Benchmarks Open source

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

GLM-5.2 Takes #2 Spot on WebDew Arena

GLM-5.2 reaches #2 position on WebDev Arena leaderboard. The Qwen model ranks highly against major competitors.

Qwen Benchmarks

SIG

HYP

Le Big Data·Jun 16

Eno : le nouveau robot de Genesis AI préfère être utile plutôt que joli

Genesis AI introduces Eno, a humanoid robot designed to perform complex tasks without aesthetic priority. The design prioritizes functional utility.

Robotics

SIG

HYP

The Decoder·Jun 16

Berlin court rules Google's AI Overviews are just a new search format, not original content

A Berlin court ruled that Google's AI-generated summaries are a 'new search result format' with no decisive Google influence over content. A perfume company sued because AI Overviews displayed its brands alongside counterfeit products. The ruling partly contradicts a Munich decision holding Google liable for false AI responses.

Regulation DeepMind

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

GLM-5.2 is available on HuggingChat

GLM-5.2, Zhipu AI's model, is now available on HuggingChat. No technical details provided in the announcement.

Qwen

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

A benchmark for tiny LLMs based on a real world problem: natural language file search (using monkeSearch)

Benchmark for small LLMs (<3B parameters) evaluating natural language parsing into structured JSON for file search. 9 models tested (Gemma-3 270M to DeepSeek R1 Distill 1.5B) on 80 queries covering file types, temporal context, and specificity. Results: 0.8B–1.5B models significantly outperform sub-0.5B.

Benchmarks Open source Code generation

SIG

HYP

Hacker News (AI)·Jun 16

GPT‑NL: a sovereign language model for the Netherlands

GPT-NL is a sovereign language model trained for Dutch, developed in the Netherlands. The project aims to reduce dependence on American models and preserve linguistic technological independence.

Open source Llama

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Mistral - New family of open-weight models @ July

Mistral announces a new family of open-weight models in July. Tweet from CEO Arthur Mensch confirms the release with no additional technical details in the excerpt.

Mistral Open source

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Glimmer 1 - Glint Research. A foundational 10,000 parameter language model

Glint Research introduces Glimmer 1, a foundational 10k parameter language model trained on 500K tokens of FineWeb-Edu. Standard Llama architecture with 16 hidden dims, 2 layers, 4 attention heads, 512 token context window. Benchmarks: arc_easy 25.46%, wikitext-2 byte perplexity 14.73.

Llama Open source Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

zai-org/GLM-5.2 is here!

GLM-5.2 is now available. The zai-org model improves reasoning and comprehension capabilities compared to previous versions.

Open source

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

bartowski/command-a-plus-05-2026-GGUF · Hugging Face

GGUF version of Command-A-Plus-05-2026 model released on Hugging Face. Author invites users to test with latest llama.cpp and share token/second benchmarks and feedback.

Open source Tools Benchmarks

SIG

HYP

Hacker News (AI)·Jun 16

Claude: Elevated errors across many models

Anthropic reports elevated errors affecting multiple Claude model versions. Users report malfunctions on the platform. No technical details provided in headline.

Claude Anthropic

SIG

HYP

Simon Willison·Jun 16

datasette-tailscale 0.1a0

Release of datasette-tailscale 0.1a0, experimental alpha plugin enabling Datasette server deployment via Tailscale. Uses Python bindings for the tailscale-rs library to connect a local instance to a Tailnet.

Tools Open source Infrastructure

SIG

HYP

Hacker News (AI)·Jun 16

GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz

GateGPT achieves 56k tokens/sec on FPGA at 80 MHz by optimizing Transformer KV cache. Hardware acceleration demonstration for inference.

Infrastructure Benchmarks

SIG

HYP

Reddit r/MachineLearning·Jun 16

I built a leakage-clean verifier for robot manipulation, is this useful? Am I solving a non-problem? [D]

Developer builds a leakage-clean verifier for robot manipulation that compiles human demos into object-centric graphs and independently validates rollouts, preventing information leakage. Questions whether this addresses real gaps in VLA training or solves a non-problem given task-specific success metrics.

Robotics Benchmarks Evals

SIG

HYP

Simon Willison·Jun 16

Quoting Georgi Gerganov

Georgi Gerganov (llama.cpp creator) uses Qwen3.6-27B daily for coding tasks on M2 Ultra and RTX 5090. He integrates it via a lightweight agent (pi) with custom system prompt for ggml-org maintenance assistance.

Qwen Code generation AI Agents

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

[Article] The Case For Open-Weight Models And Why We Can't Trust Frontier Labs | provos.org

Article arguing for open-weight models against frontier labs. Criticizes power concentration among few companies and advocates for accessibility and transparency of AI model weights.

Open source Llama Alignment

SIG

HYP

The Decoder·Jun 16

SpaceX bets $60 billion on Cursor to catch OpenAI and Anthropic

SpaceX acquires Anysphere (creator of Cursor) for $60 billion, two days after its IPO. Goal: strengthen xAI to catch up with Anthropic and OpenAI in the AI model race.

Code generation Business OpenAI

SIG

HYP

Le Big Data·Jun 16

La fin des réponses rapides ? Cet agent de recherche approfondie prend 8 heures pour répondre

Sakana AI launches Marlin, a deep research agent generating strategic reports exceeding 100 pages. The system takes 8 hours to produce detailed analyses, shifting the paradigm from speed to depth.

AI Agents Reasoning

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Anthropic going back on `claude -p` 3rd party usage

Anthropic reverses its ban on third-party wrappers for claude-p access. Community suspects a PR move rather than lasting policy shift, distinct from previous OpenClaw and Hermes bans.

Claude Open source

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Scaling former VibeThinker-1.5B to 3B — now it reaches frontier math & coding performance

VibeThinker-3B achieves 94.3 on AIME'26, 80.2 on LiveCodeBench v6, and 96.1% pass rate on unseen LeetCode contests. The model demonstrates small models can reach frontier-level reasoning performance in math and coding through clear verification signals.

Reasoning Benchmarks Code generation

SIG

HYP

Le Big Data·Jun 16

Salesforce acquiert Fin pour renforcer son offre d’IA d’entreprise

Salesforce acquires Fin for $3.6 billion to strengthen its enterprise AI strategy. The acquisition aims to accelerate the development of generative AI capabilities integrated into its platform.

Business AI Agents

SIG

HYP

Interconnects (Nathan Lambert)·Jun 16

Frontier post-training recipe review with Finbarr Timbers

Interview with Finbarr Timbers on frontier model post-training recipes. Discussion of optimization techniques and current approaches to improve large language model performance.

Reasoning Reinforcement learning

SIG

HYP

The Decoder·Jun 16

DOJ invokes national security to defend xAI's unpermitted gas turbines in NAACP lawsuit

US Justice Department invokes national security to defend xAI's unpermitted gas turbines in NAACP lawsuit, claiming Grok chatbot is essential to military operations.

Regulation AI safety Business

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Qwen Robot Suite

Alibaba announces Qwen Robot Suite, a robotics software suite based on Qwen models. Technical details and capabilities not specified in excerpt.

Qwen Robotics

SIG

HYP

Le Big Data·Jun 16

Google Cloud soutient l’ambition de superintelligence d’Ineffable Intelligence

Ineffable Intelligence raises $1.1 billion and partners with Google Cloud to pursue superintelligence ambitions. The partnership provides cloud infrastructure for large-scale model training.

DeepMind Funding Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Why might DiffusionGemma be better at tool calls than its benchmark quality suggests

DiffusionGemma generates 256 tokens in parallel with bidirectional attention, enabling self-correction before finalization. Unlike autoregressive models locked after each token, this architecture could improve structured tool calls despite lower base quality than Gemma 4. Testing needed to confirm if bidirectional correction compensates for lower quality.

Gemini Code generation Reasoning

SIG

HYP

GitHub Trending·Jun 16

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> tracel-ai /</span> burn

Burn is a next generation tensor library and deep learning framework prioritizing flexibility, efficiency, and portability.

Open source Infrastructure

SIG

HYP

GitHub Trending·Jun 16

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> homarr-labs /</span> homarr

Homarr is a modern dashboard with 40+ integrations, 20K+ built-in icons, native authentication, and drag-and-drop configuration without YAML.

Tools Open source

SIG

HYP

GitHub Trending·Jun 16

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> ParthJadhav /</span> app-store-screenshots

Open-source tool for automated app store screenshot generation using AI. Automates visual marketing asset creation for mobile applications.

Image generation Tools Open source

SIG

HYP

GitHub Trending·Jun 16

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> nocobase /</span> nocobase

NocoBase is an open-source AI + no-code platform for building business systems fast. AI works on production-proven infrastructure with WYSIWYG interface, combining speed and reliability.

Open source Business

SIG

HYP

GitHub Trending·Jun 16

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> Egonex-AI /</span> Understand-Anything

Tool converting code into interactive, explorable knowledge graphs with search and Q&A capabilities. Works with Claude Code, Cursor, Copilot, Gemini CLI, and more.

Code generation Tools Claude Code

SIG

HYP

GitHub Trending·Jun 16

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> microsoft /</span> fara

Microsoft releases Fara-7B, a 7B model optimized for agentic tasks and computer use. The model targets computational efficiency while maintaining autonomous agent capabilities.

AI Agents Code generation Open source

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Qwen3.6 27B quants

User benchmarks Qwen3.6 27B extreme quantization (IQ3 XXS turbo4) vs Q8 on code review task. IQ3 XXS (5min, 1230pp/50tg) generates comparable recommendations to Q8 (1h56m, 306pp/3tg). Finding: aggressive quantization adequate for coding tasks with good prompting.

Qwen Code generation Fine-tuning

SIG

HYP

Reddit r/MachineLearning·Jun 16

My offline ablation said -0.19pp. The production retrain said +1.11pp. [D]

ML engineer reports offline ablations (retrain with/without feature) contradicted production results. Four changes: Best Offer feature (+0.12pp offline → -0.19pp prod), auction data backfill (+0.37pp prod), outlier trimming (-0.19pp offline → +1.11pp prod), CatBoost encoder. Root causes: train/serve skew, unmeasured distribution shift, training population drift, baseline instability.

Evals Benchmarks

SIG

HYP

The Decoder·Jun 16

How easily can Russian propaganda fool AI models? A new benchmark finds out

The Institute of the Estonian Language releases a benchmark measuring how susceptible AI language models are to Russian propaganda. No technical details or quantified results provided in the excerpt.

Benchmarks AI safety Alignment

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Gemma 12b - Reasoning hardening instructions

A user shares a system instruction to improve reasoning in Gemma 12b QAT. The technique aims to reduce cognitive bias and adapt reasoning depth to context. It works well on trick questions but partially fails on certain problems depending on framing.

Gemini Prompt engineering Reasoning

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Be wary of Qwen/Claude distillations - they're often worse than the base model

Qwen/Claude distillations circulating on r/LocalLLaMA (Qwopus, Fable 5 on Qwen 3.6) use 4k-10k training samples, insufficient to improve performance. Compared to 700k samples in official DeepSeek-R1 distillations, these models don't exceed base Qwen and slightly degrade quality despite different reasoning style.

Qwen Claude Fine-tuning

SIG

HYP

Le Big Data·Jun 16

Nvidia mobilise 20 milliards de dollars de dette pour renforcer son offensive dans l’IA

Nvidia issues up to $25 billion in debt on the bond market to fund its AI expansion. This capital raise strengthens the semiconductor giant's position amid intensifying competition.

Business Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models

Trace Commons initiative: collecting coding session traces under CC-BY-4.0 license to train open-source and open-weight models. Goal: counterbalance Anthropic and OpenAI's competitive advantage from proprietary data accumulated via Claude Code and Codex.

Open source Code generation AI Agents

SIG

HYP

The Decoder·Jun 16

Anthropic backs off unpopular billing overhaul as price war with OpenAI looms

Anthropic scraps its unpopular billing overhaul for the Claude Agent SDK before launch. Third-party apps will continue drawing from regular subscription limits instead of separate credits.

Claude AI Agents Business

SIG

HYP

The Decoder·Jun 16

DeepSeek takes outside money for the first time at a $50 billion valuation

DeepSeek raises 50 billion yuan ($7.4 billion) in its first external funding round, reaching a $50 billion valuation.

DeepSeek Funding Business

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Nex-N2 Pro is the real deal

N2 Pro (rebranded as Rio-3.5) shows strong performance on coding benchmarks on 128GB macOS. User reports 100% consistency without hallucinations on private llama.cpp tests, outperforming previously tested models except GPT-5.x.

Llama Code generation Benchmarks

SIG

HYP

The Decoder·Jun 16

OpenAI burned through $34 billion last year

OpenAI spent $34 billion in the past year, significantly more than the previous year. No breakdown of cost allocation is provided.

OpenAI Business

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

A fast, optimised, and open source application for running local AI easily (made for Apple Silicon only)

AeroLLM, open-source app optimized for Apple Silicon, runs local LLMs, TTS, and STT through a GUI. Uses MLX backend for native inference, downloads models from Hugging Face with RAM-based recommendations, exposes optional API endpoint. v0.1.0 released.

Open source Tools Llama

SIG

HYP

Le Big Data·Jun 16

Hydra Host lève 100 millions de dollars pour développer ses usines dédiées à l’IA

Hydra Host raises $100 million led by Kindred Ventures to develop AI-dedicated data centers and accelerate expansion.

Infrastructure Funding

SIG

HYP

Le Big Data·Jun 16

Meta donne un gros coup d’IA à Facebook… en exploitant les publications publiques

Meta integrates AI into Facebook through a new search mode leveraging public posts. The platform promises faster responses to user queries.

Meta AI RAG

SIG

HYP

Simon Willison·Jun 16

The Fable 5 Export Controls Harm US Cyber Defense

Claude Fable 5 was banned under US export controls after a simple "fix this code" prompt enabled exploit generation. Kate Moussouris argues this is absurd: coding models must fix bugs, especially security vulnerabilities. Banning this capability weakens cyber defense.

Claude Regulation AI safety

SIG

HYP

Le Big Data·Jun 16

Ces hackers chinois utilisent Gemini pour piéger des tas de gens : Google riposte !

The FBI and Google dismantled a Chinese cybercriminal network using Gemini for attacks. Google responded to these platform abuses.

Gemini AI safety Regulation

SIG

HYP

Reddit r/MachineLearning·Jun 16

quicktok: a faster tokenizer (exact and byte-identical with tiktoken) [P]

quicktok is a BPE tokenizer written in C++ producing byte-identical tokens to tiktoken. Encodes 2–3.6× faster than bpe-openai and 4–11× faster than tiktoken itself. Supports cl100k, o200k, GPT-OSS, Llama-3, Qwen2.5/3. Optimizations: 2-byte trie, dense caches, hand-compiled pretokenizer.

Code generation Tools Open source

SIG

HYP

Hacker News (AI)·Jun 16

OpenAI Losses Increased Nearly 8X in 2025, with Spending Hitting $34B

OpenAI's losses increased nearly 8x in 2025, with spending hitting $34B. The company's financial trajectory shows accelerating infrastructure and R&D investments.

OpenAI Business

SIG

HYP

arXiv cs.AI·Jun 16

Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents

Base Sequence Analysis framework encodes LLM-powered autonomous agent behavior into symbolic sequences (X/E/P/V). Analysis of 347 production ReAct traces reveals P-X-P pattern reduces success by 10.4% and P-ratio negatively predicts success (r=-0.256). Governor runtime intervention system achieves +6.2% absolute success increase and 44% token reduction. Validated on 2,000 SWE-agent trajectories.

AI Agents Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 16

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

NVIDIA introduces Nemotron 3 Ultra, a 550B-parameter (55B active) Mamba-Transformer MoE hybrid model pre-trained on 20T tokens with 1M context length. Uses SFT, RL, and multi-teacher distillation. Achieves ~6x inference throughput of public LLMs with comparable accuracy. Base, post-trained, and quantized checkpoints, training data, and recipe open-sourced on HuggingFace.

AI Agents Reasoning Open source

SIG

HYP

arXiv cs.AI·Jun 16

AI Engram: In Search of Memory Traces in Artificial Intelligence

Study introducing a geometric framework to identify 'AI engrams'—memory traces in deep neural networks analogous to biological memory units. Authors derive a closed-form estimator enabling surgical manipulation of learned knowledge (composition, erasure) via linear arithmetic without iterative optimization. Validated on MLPs and LLMs.

Reasoning Papers Alignment

SIG

HYP

arXiv cs.AI·Jun 16

Semantics-Enhanced Retrieval-Augmented Time Series Forecasting

SERAF, a time series forecasting framework, combines retrieval of historical segments with self-generated textual descriptions. Multimodal approach tested on 7 real-world datasets to improve predictions beyond numerical similarity alone.

RAG Benchmarks Papers

SIG

HYP

arXiv cs.AI·Jun 16

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

DR-DCI combines retrieval with Direct Corpus Interaction for agent-based search over large corpora. The system uses a retriever to dynamically populate a local workspace where agents execute precise operations (filtering, comparison, verification). On Browsecomp-Plus, DR-DCI achieves 71.2% accuracy (+8.3 points vs raw DCI) and remains stable up to 10M documents, where raw DCI becomes unstable.

AI Agents RAG Reasoning

SIG

HYP

arXiv cs.LG·Jun 16

AI for Social Good: An Investigation of the Causal Relationship Between Environmental Regulations and Their Effects on Air Pollution in London, UK

Bayesian study of air pollution regulation effects in London (2010-2020). A Bayesian LSTM model integrating PM2.5 observations, meteorology, and 32 policy measures estimates average reduction of 1.88 µg/m³ (95% CI: 1.64-2.12), a relative -12.35% decrease. Effects strengthened 2013-2019.

Papers Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 16

Frame-Conditioned Moral Computation in LLaMA 3.1-8B-Instruct: A Mechanistic Interpretability Audit of Ethical Reasoning

Mechanistic interpretability audit of LLaMA 3.1-8B-Instruct on 54 moral prompts using Transluce platform. Reveals Situational Anchor Effect: domain-specific representations dominate activation rankings regardless of ethical content. Ethics capacity remains constant but salience is highly sensitive to prompt's interpretive frame. Identifies candidate ethics neuron (L16/N3837) stable across temperatures.

Llama Alignment Evals

SIG

HYP

arXiv cs.AI·Jun 16

CogGuard: Cognitive and Operational Profiling for Proactive Warning in Edge Intelligent Services

CogGuard is a proactive-warning framework for edge intelligent services using offline LLMs to build cognitive and operational profiles, then online SLMs for real-time scoring. Achieves 48% reduction in profile construction time and 19% in distributed fine-tuning on heterogeneous clusters. Reduces prediction error by 15.4% vs strongest baseline on educational datasets.

Reasoning Fine-tuning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 16

Toward Vibe Medicine: A Self-Evolving Multi-Agent Framework for Clinical Decision Support

VIBEMed is a multi-agent framework with self-evolution mechanism for clinical decision support. Three specialized agents (diagnostic, therapeutic, evolution manager) integrate patient session history and past outcomes to iteratively improve medical decisions. Results on oncology planning and complex cases.

Multi-agent AI Agents Reasoning

SIG

HYP

arXiv cs.AI·Jun 16

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

Metric Match is a method to evaluate LLM judge reliability with fewer human annotations. It selects a subset of samples whose synthetic labels match population reliability metrics. Across 15 datasets, it reduces estimation error by 18.7% and annotation needs by 32.5%, saving $1,041.67 in a medical case study.

Evals Benchmarks Papers

SIG

HYP

arXiv cs.AI·Jun 16

Feature Attribution in Directed Acyclic Graphs Using Edge Intervention

DAG-SHAP, a novel feature attribution method based on edge intervention in directed acyclic graphs. Improves existing Shapley-based methods by capturing both externality and exogenous influence of features simultaneously. Code available on GitHub.

Evals Papers

SIG

HYP

arXiv cs.CL·Jun 16

Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models

ASAG, a training-free method analyzing attention distributions, detects overthinking in reasoning models and adaptively stops generation. Tested on DeepSeek-R1-Distill and Qwen3, it improves accuracy by 3.2% while reducing generated tokens by 40% on Qwen3-8B.

Reasoning DeepSeek Qwen

SIG

HYP

arXiv cs.LG·Jun 16

Trust-Region Diffusion Policies for Massively Parallel On-Policy RL

TruDi enables diffusion policies for massively parallel on-policy RL by integrating trust-region optimization with KL-divergence constraints over entire diffusion trajectories. Evaluated on 73 tasks across 4 benchmarks: outperforms baselines on standard tasks, achieves clear gains on challenging humanoid control.

Reinforcement learning Reasoning Robotics

SIG

HYP

arXiv cs.LG·Jun 16

Edu-Theater: A Data-Efficient Agent Framework for Scalable Learner Behavior Simulation through Staging Roll-Call

Edu-Theater is an LLM-powered multi-agent system for scalable learner behavior simulation. It uses a cohort-aware approach with targeted diagnostic queries instead of dense per-learner histories, reducing LLM calls and data requirements. Tested on two real-world datasets, it improves simulation accuracy and downstream applications like adaptive testing.

AI Agents Multi-agent Reasoning

SIG

HYP

arXiv cs.LG·Jun 16

Exploring Starts Are Not Enough: Counterexamples and a Fix for Monte Carlo Exploring Starts

Study of convergence properties of Monte Carlo Exploring Starts (MCES) in tabular reinforcement learning. Authors construct counterexamples showing MCES can converge to suboptimal solutions despite initial exploration. A modification scaling learning rates inversely to update frequencies guarantees convergence to optimality.

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.LG·Jun 16

M-CTX: Exact and Scalable Spatial Context Retrieval for Trajectory Analytics

M-CTX is a spatial context-retrieval framework for trajectory analytics. It replaces three brute-force stages (OSM range retrieval, SDF computation, moving-vessel neighbor lookup) with index-backed operators. On a 5.48M-anchor maritime corpus, it reduces context construction from 17 CPU-days to 1.8 hours (226x speedup), with exact reproduction of reference context.

Benchmarks Infrastructure Open source

SIG

HYP

arXiv cs.CL·Jun 16

A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

Practical evaluation method for long-form simultaneous speech-to-speech translation (SimulS2ST) on continuous input. Uses ASR, forced alignment, and sentence embeddings to recover timestamps and align target text to source sentences, then computes sentence-level latency and quality metrics (YAAL, xCOMET). Reveals substantial latency accumulation in current systems on long speech.

Voice Evals Benchmarks

SIG

HYP

arXiv cs.CL·Jun 16

Simplifying the Modeling of Arbitrary Conditionals in Natural Language

AC-GPT modifies causal Transformers to evaluate and sample from arbitrary conditionals (past, future, mixed contexts) in a single forward pass. The method preserves left-to-right ordering and next-token prediction objective, enabling fine-tuning of existing LLMs without degrading standard performance.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.LG·Jun 16

Towards a Unified Generative Model for Scarce Time Series with Domain Experts

TimeMoDE, a framework combining Diffusion Transformers and Mixture-of-Experts, generates realistic time series under data scarcity. Pre-trained on multi-domain datasets, it uses Domain Prompts to condition expert assignment and incorporates diffusion timestep signals for adaptive denoising. Outperforms existing methods in few-shot generation settings.

SIG

HYP

arXiv cs.AI·Jun 16

CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG under Document Isolation

CONCORD is an asynchronous sparse aggregation framework for device-cloud RAG with document isolation. It uses waiting debt control and certificate-guided minimal supplementation to reduce synchronization and data transfer. Improves end-to-end throughput by 1.66× to 2.15× on Natural Questions and WikiText-2 while reducing per-token communication by over 100×.

RAG Papers Infrastructure

SIG

HYP

arXiv cs.AI·Jun 16

ChatPlanner: A Large Language Model Framework for Personalized Public Transit Routing

ChatPlanner is a framework using fine-tuned LLMs with RAG to extract user preferences from natural language and integrate them into public transit routing optimization. Evaluated on 8 personas and 5 contexts, the system combines fine-tuning (output structure) and RAG (query-specific context) to identify solutions overlooked by existing planners.

RAG Fine-tuning Prompt engineering

SIG

HYP

arXiv cs.LG·Jun 16

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

PolyKV optimizes KV cache compression by applying heterogeneous strategies per transformer layer instead of uniform policies. On LLaMA-3.1-8B and Qwen3-8B with 512-token KV budget, PolyKV recovers 54.5% and 25.7% of the LongBench performance gap versus FullKV.

Benchmarks Infrastructure Reasoning

SIG

HYP

arXiv cs.LG·Jun 16

Controlled Dynamics Attractor Transformer

CDAT couples Transformer attention with continuous attractor neural network (CANN) dynamics. The model combines von Mises-Fisher attention energy with Hopfield refinement and excitation-inhibition modulation. Achieves state-of-the-art results on graph anomaly detection and graph classification benchmarks.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 16

Contextual Bandits for Maximizing Stimulated Word-of-Mouth Rewards

Contextual multi-armed bandit framework to optimize stimulated word-of-mouth in social networks. The approach learns individual spillover probabilities and ranks connected users to maximize rewards. Experiments on real-world network datasets show improved targeting precision and rewards compared to baseline methods that ignore spillover heterogeneity.

Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 16

Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model

Theoretical study demonstrating that neural networks trained with gradient-based methods can achieve optimal computational-statistical tradeoff for Gaussian single-index models. Proposed algorithm (two-layer network) achieves sample complexity Õ(d^{s*/2} ∨ d) matching SQ lower bounds, with extension to k-sparse case via weight perturbation technique.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 16

TriAdReview: Triangular Adversarial Review Architecture for Multi-Model Technical Document Generation

TriAdReview proposes a triangular adversarial architecture with two reviewer models (engineering and security perspectives) to improve technical document generation. Across 75 experiments, the triple model achieves +10.1% over baseline (26.2 vs 23.8/50, p<0.05), with strong gains on security audit (+27.6%), code generation (+20.8%), architecture design (+15.6%), but -7.5% degradation on requirements analysis.

Multi-agent Code generation Benchmarks

SIG

HYP

arXiv cs.LG·Jun 16

An Integrable Token Mixing Layer from the Generalized Yang Baxter Equation

YB Mixer is a token mixing layer derived from free fermion and generalized Yang-Baxter structures. It uses the Ising exchange algebra to create an orthogonal norm-preserving fermionic structure with commuting transfer matrices enabling order-free inference. A spectral circulant generator ensures generalization to longer sequences.

Reasoning Papers

SIG

HYP

arXiv cs.AI·Jun 16

Towards Verifiable Agentic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning

IRTS-ToolBench, a benchmark of 1,700 questions across 10 task types and 13 domains, evaluates how LLMs and AI agents handle irregular time series (asynchronous, informative missing values, variable sampling frequencies). Bridges gap between existing TSQA benchmarks (regular data) and real-world deployments.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·Jun 16

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

Study on evaluating reasoning models beyond accuracy alone. Authors introduce two metrics: susceptibility (whether bias breaks a previously correct answer) and acknowledgment (whether the trace explicitly references injected biased content). On GSM8K, GPT-4o and Claude Sonnet 4 show similar susceptibility rates (1.3% vs 1.2%) but substantially different acknowledgment rates (13.0% vs 75.0%).

Evals Reasoning AI safety

SIG

HYP

arXiv cs.LG·Jun 16

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

StarOR synergizes Monte Carlo Tree Search with test-time reinforcement learning for optimization modeling. The framework decomposes modeling into four stages, refines a transient LoRA adapter via GRPO at each node, and employs an unsupervised multi-faceted reward system. Achieves state-of-the-art results across five optimization benchmarks with a 4B backbone.

Reasoning Reinforcement learning Fine-tuning

SIG

HYP

arXiv cs.AI·Jun 16

Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

Mask-Proof is an automated pipeline converting real mathematical proofs into verifiable masked-step tasks. The benchmark contains 292 curated problems. Testing 17 models shows reasoning-enhanced models outperform standard models by 12-27%. The evaluator achieves 96.8% agreement with expert annotators.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.LG·Jun 16

Transformers Learn the Mestre-Nagao Heuristic

Two-layer transformers classify rational elliptic curves (rank 0 vs 1) with >99% accuracy from 128 Frobenius traces. Mechanistic interpretability analysis reveals a sparse circuit of 20 neurons implements the Mestre-Nagao heuristic (weights log(p)/(p·log B), r=0.997), autonomously discovering an analytic number theory result.

Reasoning Evals Papers

SIG

HYP

arXiv cs.AI·Jun 16

Towards End-to-End Automation of AI Research

The AI Scientist automates the entire research lifecycle: idea generation, coding, experiments, data analysis, manuscript writing, and peer review. An AI-generated manuscript passed the first round at a major ML conference workshop (70% acceptance rate). The system leverages foundation models within a complex agentic architecture.

AI Agents Multi-agent Papers

SIG

HYP

arXiv cs.LG·Jun 16

A Comparative Study of Graph Neural Network Layer Selection for Interaction Modelling in Driving Trajectory Prediction

Comparative study of 19 GNN layer types for trajectory prediction in autonomous driving. ARMA, Chebyshev, and topology-aware layers consistently outperform others. Sum-based aggregation, multi-head attention, and distance-weighted hops significantly improve prediction accuracy.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 16

Leveraging Physiological Signals to Predict Exam Outcomes with Machine Learning

Study comparing ML models (logistic regression, random forest, SVM, transformers, LSTM, GRU) to predict exam outcomes from physiological signals (electrodermal activity, heart rate, skin temperature). Random forests outperform deep learning models in computational efficiency and interpretability.

Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·Jun 16

FastMix: Fast Data Mixture Optimization via Gradient Descent

FastMix automates data mixture optimization for model training via gradient descent. The method reformulates mixture selection as a bilevel optimization problem, jointly optimizing mixture coefficients and model parameters. A single proxy model suffices, drastically reducing search cost compared to prior approaches.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.LG·Jun 16

Separable Neural Architectures as Physical World Models: from Mathematical Theory to Applications

New Separable Neural Architecture (SNA) combining neural approximation with tensor decomposition to solve high-dimensional PDEs. Variational framework (VSNA) guarantees well-posedness and convergence. Demonstrates 150,000x speedup vs FEM on A100 GPU for 7D parametric simulation and real-time thermal inversion of Inconel 718 (<100ms).

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 16

Relational Structural Causal Models

Theoretical paper on relational structural causal models (RSCM), extending Pearl's SCM to settings with varying objects and relations. Introduces symbolic identification criteria and relational neural causal models, validated on simulated traffic scenes with variable cars, signals, and pedestrians.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.LG·Jun 16

Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts

Theoretical paper on dynamic routing of queries to multiple embedding models. Formalizes the problem as an adversarial contextual linear bandit with low-rank experts. Proposes Hypentropy Policy Gradient (HPG) algorithm achieving Õ(s√MT) linearized policy regret without curse of dimensionality.

Benchmarks Reasoning Reinforcement learning

SIG

HYP