Andrej Karpathy joins Anthropic
Andrej Karpathy joins Anthropic as Senior Research Scientist. The former Tesla AI co-founder and director of AI at Tesla brings deep learning and AI systems expertise to the company.
Andrej Karpathy joins Anthropic as Senior Research Scientist. The former Tesla AI co-founder and director of AI at Tesla brings deep learning and AI systems expertise to the company.
NVIDIA Labs releases Sana, a linear diffusion transformer for efficient high-resolution image synthesis. Architecture reduces computational complexity while maintaining visual quality.
Anthropic expands Claude Managed Agents with self-hosted sandboxes and MCP tunnels. Companies can now execute their AI agents' tools in their own infrastructure while keeping agent control with Anthropic.
UniversalRAG is a multi-modal RAG framework that retrieves and integrates knowledge from heterogeneous sources (text, images, videos) at variable granularities. It introduces modality-aware routing to avoid intra-modal bias and organizes each modality into granularity levels. Validated on 10 benchmarks, it outperforms single-modality and unified baselines.
Critical study of LLM-based trading agents (FinCon, FinMem, TradingAgents, FinAgent, QuantAgent, FLAG-Trader). Reported Sharpe ratios do not constitute deployment evidence: temporal contamination, unmodeled frictions, and insufficient predictive calibration invalidate claims. Proposes P1-P6 protocol and modular architecture with LLM as audit interface.
AMARIS enhances rubric-based RL by integrating persistent evaluation memory. The system accumulates evaluation diagnostics over time, retrieves them via static and semantic search, and continuously adapts reward rubrics. Experiments show performance gains with ~5% time overhead.
New GPRL (General Preference Reinforcement Learning) method replaces scalar reward models with General Preference Model (GPM) using k skew-symmetric subspaces. Tested on Llama-3-8B-Instruct: 56.51% win rate AlpacaEval 2.0, outperforms SimPO and SPPO on Arena-Hard, MT-Bench, WildBench by preventing single-axis reward hacking.
RAC (Robust Agent Compensation) is a log-based recovery paradigm integrated into agent frameworks (LangGraph, LangChain) to prevent unintended side effects. Requires no code changes. Results: 1.5-8X better latency and token efficiency vs state-of-the-art LLM recovery approaches on τ-bench and REALM-Bench.
Semantic Generative Tuning (SGT) aligns visual understanding and generation in unified multimodal models by using image segmentation as a generative proxy. High-level semantic tasks improve feature linear separability and visual-textual attention allocation, outperforming decoupled training approaches.
Unified framework for disease trajectory modeling in clinical AI, integrating factual forecasting, counterfactual estimation, and policy evaluation. Addresses treatment assignment bias, time-varying confounding, and observation bias to transform static predictions into treatment-sensitive dynamic estimates.
Systematic study of code as AI agent infrastructure. Three layers: harness interface (code connects reasoning, action, environment modeling), mechanisms (planning, memory, tool use, feedback), and multi-agent scaling. Applications: coding assistants, GUI/OS automation, embodied agents, scientific discovery, DevOps.
Comparative study of tabular foundation models (TFMs) vs classical models on credit default prediction. On Home Credit and Lending Club datasets, context construction strategy (balanced vs uniform sampling) explains more AUC-ROC variance than model choice: +3-4 AUC points. With 5K-10K balanced examples, TFMs match classical GBDTs while improving default-class recall.
arXiv study demonstrating that analytic choices (model selection, sampling parameters, prompt format, demographic data) materially affect the fidelity of "silicon samples" (synthetic datasets generated by LLMs). Across 252 configurations tested, correlations with human data range from r=.23 to r=.84, revealing a major risk of analytic flexibility.
SAME is an autoencoder for stereo music and general audio achieving 4096× temporal compression while maintaining reconstruction quality. The architecture combines a transformer backbone, semantic regularization, phase-aware reconstruction losses and improved discriminators. Two variants (SAME-L and SAME-S) are released in open-weights.
Researchers demonstrate typographic attacks against household manipulation robots using CLIP. By placing adversarial stickers, they achieve 67.8% attack success rate on HomeRobot benchmark in Habitat simulation, causing physical grasping and transport errors of wrong objects.
VGGT-CD is a training-free pipeline for 3D change detection from multi-view images. It decouples cross-temporal registration from dynamic-change interference via joint keyframe inference and dense reconstruction purification. On the World Across Time benchmark, it reduces Absolute Trajectory Error by 44% outdoors and 59% indoors, 6× faster.
RoboMME is a standardized benchmark for evaluating memory in vision-language-action (VLA) models for long-horizon robotic manipulation. 16 tasks test temporal, spatial, object, and procedural memory. 14 memory-augmented VLA variants built on π0.5 show effectiveness is highly task-dependent.
RAT (Randomized Advantage Transformation) estimates Tikhonov-regularized natural policy gradients via direct backpropagation without explicit Fisher matrix construction. The method applies the Woodbury formula and randomized block Kaczmarz iterations on on-policy mini-batches. Results match or exceed established natural-gradient methods on continuous and visual control benchmarks.
CrossView Suite introduces CrossViewSet (1.6M multi-view samples), CrossViewBench (evaluation benchmark), and CrossViewer (three-stage framework: Perception → Alignment → Reasoning) to enhance cross-view spatial reasoning in MLLMs. A multi-agent data engine generates annotated data covering 17 fine-grained task types.
Key-Gram is a conditional-memory framework separating linguistic knowledge from visual reasoning for embodied control. It decomposes instructions into task-specific key-grams, retrieves linguistic priors via O(1) hashed lookup, and injects them into hidden layers. Achieves 29.5% gains on RoboTwin2.0, 35.8% on LIBERO-Plus, 15.4% on real-world tasks.
STT-Arena is a benchmark of 227 interactive tasks measuring LLMs' ability to detect and adapt to spatio-temporal changes. Claude-4.6-Opus achieves under 40% accuracy. Authors identify three recurring failure modes and propose STT-Agent-4B combining iterative trajectory refinement with online RL.
MoleCode is an LLM-native molecular language representing molecules as explicit graphs with typed entities and direct relations, replacing implicit SMILES strings. Training-free, it improves frontier LLMs on molecular reasoning, editing and generation tasks, especially for unfamiliar molecules, topology-sensitive operations and larger structures.
Theoretical paper on generative model evaluation. Authors show standard criteria (marginal matching) don't guarantee covariance structure preservation. They introduce D_Sigma = ||Sigma_P - Sigma_Q||_F to measure dependence fidelity, with formal proofs and validation on Fashion-MNIST VAE, RNA-seq (TCGA-BRCA, n=1111), and Alzheimer's data (n=113).
Six modern tabular foundation models form a highly redundant ensemble (mean Q-statistic 0.961). On 153 OpenML classification tasks, the best ensemble (two-level cascade stacking) gains +0.18% accuracy at 253× compute cost. Friedman-Nemenyi analysis places three ensembles and the best single model in the same equivalence group. Greedy selection is recommended as practical default.
Researchers train KinGPT (25M parameters) on chess data and demonstrate that high benchmark scores of chess-trained LLMs stem primarily from pattern-matching rather than genuine rule understanding. LLM-Modulo, a verifier-in-the-loop framework, improves RedPajama 3B from 1.2% to 21.2% best-move accuracy. Training code, datasets, and model checkpoints open-sourced.
CheckSupport is an open-source system using locally-deployed LLMs to automate reporting checklist recommendation and completion for scientific manuscripts. Evaluated on peer-reviewed manuscripts, it achieves 90% accuracy for checklist recommendations and 88% for item-level completion, processing each manuscript in 12.5 seconds on CPU-only hardware.
Critical audit of SAEBench, the de-facto standard evaluation suite for sparse autoencoders (SAEs). TPP and SCR metrics fail multiple reliability tests and should not be used. Other metrics show higher reseed noise and lower discriminability than assumed. Only sae-probes demonstrates acceptable reliability, but struggles to distinguish architecture variants.
TABOM, a post-training method for Diffusion Language Models, aligns optimization with the multi-step easy-to-hard decoding trajectory observed at inference. Via Boltzmann modeling of unmasking preferences, it derives a tractable pairwise ranking objective that reduces training-inference discrepancy and improves performance on new domains.
Focused Forcing optimizes KV caches in autoregressive video diffusion generation by selecting relevant historical frames per-frame and per-head. The method combines attention scores with diversity scores, achieving 1.48× end-to-end acceleration without training while improving visual quality and text alignment.
Neuro-symbolic study comparing LLMs constrained by deterministic logic scaffolds (Rulemapping) versus unconstrained prompting for hate speech moderation under German Criminal Code (§130). Rulemapping achieves precision 0.80-0.86 and recall 0.82-0.89 versus 0.34-0.49 with unconstrained prompting, eliminating conflation of moral offense with legal illegality.
NavOne reformulates vision-language navigation (VLN) as one-step global path planning on top-down maps. The framework directly predicts dense path probabilities via single end-to-end forward pass, achieving 8x speedup over map-based methods and 80x over egocentric approaches. Introduces R2R-TopDown dataset.
StructLens analyzes the internal organization of representations in language models using maximum spanning trees built on residual streams. The framework reveals that middle layers strongly organize nearby tokens, and that smaller local units emerge before larger units during pre-training.
SWIM aligns vision-language representations for fine-grained video object understanding from text prompts alone. Uses mask supervision during training to guide cross-modal attention. Constructs NL-Refer dataset with precise natural language referring expressions. Outperforms visual-prompt-based methods on fine-grained benchmarks.
Context Codec introduces a formal framework for compressing LLM context while preserving semantic commitments (goals, constraints, decisions, evidence). It defines metrics (Critical Atom Recall, Commitment Density) and CCL, an ASCII-first compact rendering language, to make context compression verifiable and auditable.
MARS is a multimodal system for the CASTLE 2026 challenge that reasons over 4 days of activity, 15 synchronized perspectives, transcripts, and auxiliary modalities (photos, videos, gaze, thermal imagery, heartrate). The approach uses DeepSeek for video summaries and a GPT-5.4 agent to select evidence sources. The system achieved second place on the final leaderboard.
Benchmark of LLMs on multi-label legal precedent treatment classification. Expert-annotated dataset of 239 real-world citations. Gemini 2.5 Flash achieves 79.1% on high-level classification, GPT-5-mini 67.7% on fine-grained schema. Novel Average Severity Error metric to measure practical impact of misclassifications.
TinySAM 2 compresses SAM 2 for efficient video segmentation. Memory quality management mechanism + joint spatial-temporal token compression. Achieves 90% of SAM 2.1 performance with 7% memory tokens and 3% training data. Reduces parameters, computational load, and deployment costs.
EmoMind decodes affective captions directly from brain fMRI signals. The system first retrieves a neutral scene description from brain-decoded visual features, then rewrites it using a continuous 34-dimensional emotion vector extracted from the same fMRI recording. Evaluated on two independent emotion fMRI datasets, EmoMind outperforms GPT-4 with discrete emotion labels across all validation axes.
Asynchronous RAG framework predicting when and what to retrieve using three components (retrieval predictor, context monitor, query generator). Achieves 43.5% end-to-end latency reduction and 62.4% time-to-first-token improvement while maintaining answer quality.
Guided Topology Diffusion (GTD) uses graph diffusion models to dynamically generate optimal communication topologies for multi-agent LLM systems. The iterative framework, guided by a proxy model predicting multi-objective rewards (accuracy, utility, cost), adapts topologies to tasks without gradient-based optimization, outperforming static approaches.