Archives

May 2026

3147 articles

Reddit r/MachineLearning·

The famous METR AI time horizons graph contains numerous severe errors [D]

Nathan Witkin (NYU Stern) harshly critiques METR's AI time horizons graph. Errors include: unmeasured human baselines merely estimated, hourly-paid benchmarkers incentivized to work slowly, biased sample toward authors' peers, and failure to account for familiarity advantage (5-18x faster). Witkin concludes the graph contains too many compounding errors to be salvaged.

BenchmarksEvalsAI safety
SIG
75
HYP
45
Reddit r/MachineLearning·

DCGAN inference on a microcontroller: 12.6M parameters, 512KB SRAM, 26-second generation, pure C [P]

DCGAN with 12.6M parameters runs on RISC-V CH32H417 microcontroller (512KB SRAM). Generates 64×64 cat faces in 26 seconds using pure C inference engine with int8 per-channel quantization. Weights streamed from SD card via double buffering. Z vector seeded with 200 bytes quantum random data (ANU QRNG). No existing frameworks (TFLite, CMSIS NN) — built from scratch.

Code generationBenchmarksOpen source
SIG
78
HYP
25
Reddit r/MachineLearning·

We gave an LLM a structural graph of a codebase before exploring. It used 54% MORE context than without one. Paper + explanation inside [R]

Controlled study on TypeScript codebase (25 sections, 3,250 files): LLM (Kimi K2.6) equipped with structural graph (Blueprint: Universal Ctags + ast-grep + BM25) consumed 54% more input tokens (63,541 vs 41,327) but explored deeper (6 turns vs 5). Graph costs ~6,500 tokens and increases model's navigational confidence.

Code generationRAGBenchmarks
SIG
75
HYP
25
Reddit r/LocalLLaMA·

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

RTPurbo transforms full-attention LLMs into sparse models in hundreds of training steps. The method exploits three observations: only certain heads require full attention, long-range retrieval uses a 16D subspace, and token selection is query-dependent. Results: 9.36x prefill speedup at 1M context, 2.01x decode speedup, accuracy preserved.

ReasoningBenchmarksInfrastructure
SIG
78
HYP
25
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> hardikpandya /</span> stop-slop

Stop-slop is a skill file designed to detect and remove common AI-generated text markers from prose, such as repetitive phrases and generic formulations.

Prompt engineeringTools
SIG
35
HYP
45
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> garrytan /</span> gstack

Gstack: 23 opinionated Claude Code tools configured from Garry Tan's setup, covering CEO, designer, engineering manager, release manager, doc engineer, and QA roles.

Claude CodeAI AgentsTools
SIG
45
HYP
55
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> affaan-m /</span> ECC

Agent harness performance optimization system. Integrates skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, and Cursor.

AI AgentsClaude CodeCode generation
SIG
35
HYP
55
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> anthropics /</span> claude-cookbooks

Anthropic releases claude-cookbooks, a collection of notebooks and recipes demonstrating practical and creative ways to use Claude.

ClaudePrompt engineering
SIG
65
HYP
25
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> moeru-ai /</span> airi

Airi is a self-hosted, open-source AI companion capable of real-time voice chat, Minecraft and Factorio gameplay. Supports Web, macOS, and Windows. Inspired by Neuro-sama.

AI AgentsVoiceOpen source
SIG
35
HYP
65
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> AlexsJones /</span> llmfit

llmfit: CLI tool to test hundreds of LLM models and providers on your hardware. One command to identify what runs locally.

ToolsOpen sourceInfrastructure
SIG
65
HYP
25
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> Zackriya-Solutions /</span> meetily

Meetily is an open-source, self-hosted meeting assistant built on Rust. 4x faster transcription than Whisper/Parakeet, speaker diarization, Ollama-based summarization. 100% local processing, no cloud required.

Open sourceVoiceTools
SIG
65
HYP
35
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> nearai /</span> ironclaw

IronClaw is an Agent OS emphasizing privacy, security, and extensibility. Open-source project hosted on GitHub.

AI AgentsOpen sourceAI safety
SIG
35
HYP
25
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> NateBJones-Projects /</span> OB1

OB1 (Open Brain) offers a unified infrastructure layer: one database, one AI gateway, one chat channel. Compatible with any AI model, no middleware or SaaS required.

InfrastructureAI AgentsOpen source
SIG
35
HYP
65
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> CodebuffAI /</span> codebuff

CodebuffAI: command-line tool for code generation. Generates code directly from the terminal.

Code generationTools
SIG
35
HYP
45
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> garrytan /</span> gstack

Gstack: Garry Tan's Claude Code setup with 23 opinionated tools automating CEO, designer, engineering manager, release manager, doc engineer, and QA roles.

Claude CodeAI AgentsCode generation
SIG
45
HYP
65
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> moeru-ai /</span> airi

Airi is a self-hosted AI companion supporting real-time voice chat, Minecraft and Factorio gameplay. Web, macOS and Windows support. Open-source project inspired by Grok and Neuro-sama.

Open sourceVoiceAI Agents
SIG
35
HYP
65
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> OpenBB-finance /</span> OpenBB

OpenBB is an open-source financial data platform for analysts, quants and AI agents. It provides unified access to market data through a GitHub-hosted repository.

Open sourceAI AgentsTools
SIG
45
HYP
25
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> sansan0 /</span> TrendRadar

TrendRadar is an AI-driven trend monitor aggregating multi-platform news via RSS with smart alerts. Filters by keywords, translates and analyzes articles via AI, supports MCP for natural language dialogue, Docker deployment with local/cloud data, integrations with WeChat/Feishu/DingTalk/Telegram/Slack.

AI AgentsMCPRAG
SIG
45
HYP
55
Reddit r/LocalLLaMA·

I built a computer use sandbox framework for codex on headless linux. GPU passthrough, computer use, and sudo access for codex all work. It's the perfect dev sandbox to allow full auto work while minimizing the "rm -rf /" risk

Developer builds sandbox framework for AI agents on headless Linux with GPU passthrough, sudo access, and host OS isolation. VM-based architecture enables autonomous web browsing, Docker execution, and parallel sessions. Code released on GitHub.

AI AgentsCode generationInfrastructure
SIG
72
HYP
28
arXiv cs.AI·

KPI2KVI: A Multi Agent Workflow for Calculating Key Value Indicators from Service Descriptions

KPI2KVI transforms natural language service descriptions into Key Value Indicator (KVI) estimates using a deterministic multi-agent LLM workflow. The system elicits missing context, extracts relevant KVI categories, generates service-specific KPIs, collects values through interactive dialogue with intelligent estimation, and computes interval-valued KVIs with traceable explanations.

AI AgentsMulti-agentPrompt engineering
SIG
72
HYP
25
arXiv cs.AI·

The Cognitive Kardashev Scale: Quantifying the Material Envelope of Civilisational Computation

Theoretical paper proposing a Cognitive Kardashev Scale to quantify AI compute capacity civilisations could sustain. Based on four parameters (total power, cognition share, energy efficiency, brain reference), the study estimates current humanity at K≈0.73 (Type I). At Type I with 1% power allocation, each human would have access to one personal AI's worth of cognition.

ReasoningBenchmarks
SIG
45
HYP
25
arXiv cs.AI·

Ontological Knowledge Blocks: Executable Compliance and Profile-Based Validation for Trustworthy AI Systems

Ontological Knowledge Blocks (OKBs): programmable governance infrastructure compiling regulatory obligations into machine-checkable constraints over structured evidence graphs. Uses RDF/OWL, SHACL, and PROV-O. Prototype evaluated on HPC resource allocation with 24 validation runs and 4 governance profiles. SHACL validation latency: 12.6–100.3 ms.

RegulationAI safetyAlignment
SIG
72
HYP
15
arXiv cs.CL·

DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge

DFKI-MLT applies activation steering to multilingual LLMs to improve cultural awareness in SemEval-2026 Task 7. The method adds language-specific steering vectors to the residual stream without parameter updates. Result: 86.96% accuracy on MCQ track (7th/17), but modest and heterogeneous improvements varying by language-region pair and layer selection.

Prompt engineeringReasoningFine-tuning
SIG
72
HYP
18
arXiv cs.LG·

Uncovering the Latent Potential of Deep Intermediate Representations

Study showing task-relevant information is distributed non-monotonically across layers in foundational models. Introduces LOES (Layer-wise Optimal Embedding Selection), a spectral method identifying task-discriminative subspaces, and GeoReg, geometric regularization enforcing simplicial structure on class manifolds. Consistent improvements across architectures and modalities.

Fine-tuningEmbeddingsPapers
SIG
72
HYP
15
arXiv cs.LG·

Steered Generation via Gradient-Based Optimization on Sparse Query Features

Prototype-Based Sparse Steering applies Sparse Autoencoders to LLM attention query activations to decompose representations into interpretable features. Gradient-based optimization during inference aligns sparse representations with target behavior prototypes. Validated on Textualized Gridworld (planning constraints) and educational domain (cognitive complexity via Bloom's Taxonomy).

ReasoningFine-tuningPapers
SIG
72
HYP
18
arXiv cs.LG·

A mathematical theory of balancing relational generalization and memorization

Theoretical study on balancing relational generalization and memorization in learning systems. Authors introduce transitive inference with exceptions task and analytically characterize kernel ridge regression models across representations. Validation on pretrained language models shows successful generalization depends on representational geometry, with systematic errors predicted by theory.

PapersReasoningEvals
SIG
72
HYP
15
arXiv cs.CL·

Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

MaR (Metacognition-as-Reward) is an RL framework improving LLM reasoning via two dimensions: metacognitive knowledge (identifying task-relevant information) and metacognitive regulation (planning the reasoning process). Tested on 22 benchmarks, Qwen3.5-9B + MaR achieves up to 7.7% gain over base model and 11.0% over vanilla DAPO, surpassing GPT-OSS-120B on average.

Reinforcement learningReasoningQwen
SIG
78
HYP
25
arXiv cs.CL·

Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

Study of 16 language models (1.5B–72B parameters) showing representational convergence does not extend to reasoning processes. Models align more on collectively failed problems (CKA=0.897) than solved ones (CKA=0.830). Post-decision representations diverge sharply (CKA=0.274), and shared information exerts minimal causal influence (1.5–5.5% flip rate).

PapersReasoningEvals
SIG
78
HYP
15