Archives

May 2026

3147 articles

arXiv cs.LG·

Not All Tokens Are Worth Caching: Learning Semantic-Aware Eviction for LLM Prefix Caches

SAECache introduces a semantic-aware eviction policy for LLM prefix caches. Not all tokens are equally worth caching: different token types (system prompts, user queries, tool outputs, reasoning) show up to 756x variation in reuse rates. SAECache uses a multi-queue architecture with online learning to adapt priorities, achieving 1.4x-2.7x TTFT improvement over production baselines.

ReasoningInfrastructureBenchmarks
SIG
78
HYP
15
arXiv cs.LG·

VCR: Learning Valid Contextual Representation for Incomplete Wearable Signals

VCR is a self-supervised framework learning robust representations from incomplete wearable sensor signals. It uses an orthogonal tokenizer to disentangle shared semantics from modality-specific residuals, combined with a missing-aware mixture-of-experts backbone. VCR improves performance on health monitoring tasks under single and multiple missing modalities.

PapersEmbeddingsReinforcement learning
SIG
72
HYP
18
arXiv cs.LG·

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

Study of 63 base models reveals hidden phase transition: below ~3.5B parameters, reasoning and truthfulness anticorrelate; above, they cooperate. Architecture, data curation, and training recipe independently shift this critical threshold. Width normalization eliminates anticorrelation; frontier models reach r=+0.72. Open-source steering tool and diagnostic dashboard released.

BenchmarksAlignmentReasoning
SIG
82
HYP
25
arXiv cs.LG·

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

Analysis of 34 frontier models (2024-2026) showing reasoning and coding capabilities cooperate (r=+0.72) but vary by lab. DeepSeek shifted from reasoning-rich to coding-first (+11.2→-4.7); Google maintains balance; Anthropic oscillates. SWE-bench saturating while HLE and instruction-following remain discriminative. Seven falsifiable predictions for next 12 months with interactive dashboard.

BenchmarksEvalsReasoning
SIG
78
HYP
22
arXiv cs.AI·

Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

Learn-by-Wire Guard (LBW-Guard) is an autonomous governance layer that supervises the AdamW optimizer during language-model training. Tested on Qwen2.5-7B with WikiText-103, LBW-Guard reduces final perplexity from 13.21 to 10.74 (−18.7%) and accelerates training by 1.10×. Under extreme learning-rate stress (LR=3e-3), AdamW fails (perplexity 1885.24) while LBW-Guard remains stable (11.57).

QwenReinforcement learningBenchmarks
SIG
72
HYP
25
arXiv cs.AI·

POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

POLAR-Bench is a diagnostic benchmark assessing privacy-utility trade-offs in LLM agents. A trusted model with privacy policy interacts with an adversarial third-party model across 10 domains and 7,852 samples. Frontier models withhold 99% of protected attributes, but open-weight models in the 1–30B range commonly used for on-device private inference leak up to 50% of sensitive data.

AI AgentsAI safetyAlignment
SIG
78
HYP
25
arXiv cs.LG·

From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning

CPSS (Constraint Projection Safety Shield) converts cumulative safety budgets into adaptive state-level control constraints for nonstationary reinforcement learning. The mechanism dynamically adjusts risk thresholds based on context, guarantees per-state threshold satisfaction, and reduces safety violations in highway merging scenarios.

Reinforcement learningAI safetyReasoning
SIG
72
HYP
18
arXiv cs.AI·

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

Self-evolving skill libraries suffer silent degradation termed 'library drift': unbounded accumulation without lifecycle management. Study isolates mechanism via ablations, provides trace-level diagnostics, and validates fix (outcome-driven retirement + bounded active-cap + meta-skill prior) lifting pass@1 from 0.258 baseline to 0.584 on MBPP+ hard-100.

AI AgentsCode generationBenchmarks
SIG
78
HYP
15
arXiv cs.AI·

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

Formal Skill is a runtime abstraction for LLM agents that structures reusable capabilities via JSON metadata, action schemas, Python executors, and hook-governed control logic. Implemented in FairyClaw (open-source event-driven runtime), it replaces natural-language procedures with executable state machines, reducing token usage while improving reliability on Harness-Bench.

AI AgentsMCPCode generation
SIG
78
HYP
25
arXiv cs.LG·

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

LILAC+ proposes a framework for safe continual reinforcement learning in nonstationary environments. The system combines three adaptive mechanisms: context-based safety constraints, adaptation-speed constraints, and budget-to-state enforcement. Evaluated in simulated driving, it reduces safety violations under distribution shift while maintaining competitive task performance.

Reinforcement learningAI safetyAlignment
SIG
72
HYP
18
Reddit r/MachineLearning·

I built a tool that shows you what GPT-2 is "thinking" in real-time as it generates 3D graph of concept activations per token [R]

AXON visualizes real-time concept activations in GPT-2 through a 3D force-directed graph. A Sparse Autoencoder decomposes the residual stream into interpretable features (geography, cities, languages) per generated token. Stack: TransformerLens + SAELens (backend), FastAPI WebSocket, Three.js (frontend). ~35ms/token on GPU.

GPTOpen sourceTools
SIG
72
HYP
35
Reddit r/LocalLLaMA·

PrivateScribe.ai - Fully local, MIT licensed, free AI transcription built with HIPAA/legal safeguards in mind - One Year Update!

PrivateScribe.ai, fully local open-source transcription platform (MIT license), announces v1 with signed macOS app. Stack: FasterWhisper, pyannote, Ollama, Vite/Flask/SQLite. 256-bit encryption, zero network calls, audit trail, speaker diarization. Built for clinics, law firms, therapists with HIPAA compliance.

Open sourceVoiceCode generation
SIG
72
HYP
28
Reddit r/LocalLLaMA·

A tool I built to generate 3D objects with functional, articulated parts. It's on github, and is mostly LLM-agnostic.

Open-source tool to generate 3D objects with articulated, functional parts. Instead of diffusion (point-cloud blobs), the pipeline uses an LLM as a structured code compiler, generating native Blender Python code targeting specific scene graph nodes. Flutter/Three.js frontend, model-agnostic. Gemini recommended; local models still hallucinate on complex matrix transforms.

Code generationOpen sourceTools
SIG
72
HYP
35
Reddit r/LocalLLaMA·

unpopular opinion: cursor and claude code arent getting dumber, their agent loops are structurally blind and suffocating your context window

A user critiques the architecture of code agents (Cursor, Claude Code): models aren't degrading, but their exploration loops are structurally blind. They dump massive files into context, generate noise (logs, MCP definitions), and lose project memory per session, saturating the context window before reasoning begins.

Claude CodeAI AgentsCode generation
SIG
45
HYP
55
Reddit r/MachineLearning·

Graph spectral analysis (Fiedler value + Scheffer CSD indicators) predicts grokking 21k steps before loss function - five reproducible experiments [R]

Graph spectral analysis (Fiedler value + Scheffer critical slowing down) predicts grokking 21k steps before loss convergence. Five reproducible CPU experiments: early detection, distinct structural fingerprints for grokking vs catastrophic forgetting, guided intervention preserves 91.7% vs 2.6%, 48x acceleration across sequential tasks. Limited to 2-layer MLPs and 1-layer transformers.

PapersEvalsReasoning
SIG
72
HYP
28
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> alirezarezvani /</span> claude-skills

Repository of 313+ skills for Claude Code and 8 other coding agents (Codex, Gemini CLI, Cursor). Covers engineering, marketing, product, compliance, research, operations and productivity.

Claude CodeAI AgentsTools
SIG
35
HYP
55
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> multica-ai /</span> andrej-karpathy-skills

A CLAUDE.md file based on Andrej Karpathy's observations to improve Claude Code behavior and address common LLM coding pitfalls.

ClaudeClaude CodePrompt engineering
SIG
45
HYP
35
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> colbymchenry /</span> codegraph

Codegraph: pre-indexed code knowledge graph for Claude Code, Codex, Cursor, and OpenCode. Reduces tokens and tool calls, runs 100% locally.

Claude CodeCode generationRAG
SIG
65
HYP
25
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> HKUDS /</span> CLI-Anything

CLI-Anything converts command-line interfaces to make them compatible with AI agents. The project aims to make all software "agent-native" through a unified CLI approach.

AI AgentsToolsOpen source
SIG
45
HYP
55
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> Alishahryar1 /</span> free-claude-code

Tool enabling free Claude Code usage via terminal, VSCode extension, or Discord with voice support, inspired by OpenClaw.

Claude CodeToolsOpen source
SIG
35
HYP
55
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> humanlayer /</span> 12-factor-agents

12-factor-agents outlines principles for building production-ready LLM-powered agents. The GitHub project adapts 12-factor methodology to establish best practices for autonomous AI systems deployed to customers.

AI AgentsOpen source
SIG
45
HYP
35
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> msitarzewski /</span> agency-agents

Agency-agents: open-source framework to deploy a multi-agent AI agency with specialized experts. Each agent has distinct roles (frontend, community management, validation) with defined processes and deliverables.

Multi-agentAI AgentsOpen source
SIG
45
HYP
65