Sandbox persistence is now GA
Vercel Sandboxes enables filesystem persistence by default in GA. Snapshots are automatic, sandboxes resume from the latest saved state. New methods: fork(), getOrCreate(), delete(), custom tags, and lifecycle hooks.
Vercel Sandboxes enables filesystem persistence by default in GA. Snapshots are automatic, sandboxes resume from the latest saved state. New methods: fork(), getOrCreate(), delete(), custom tags, and lifecycle hooks.
Nathan Witkin (NYU Stern) harshly critiques METR's AI time horizons graph. Errors include: unmeasured human baselines merely estimated, hourly-paid benchmarkers incentivized to work slowly, biased sample toward authors' peers, and failure to account for familiarity advantage (5-18x faster). Witkin concludes the graph contains too many compounding errors to be salvaged.
Controlled study on TypeScript codebase (25 sections, 3,250 files): LLM (Kimi K2.6) equipped with structural graph (Blueprint: Universal Ctags + ast-grep + BM25) consumed 54% more input tokens (63,541 vs 41,327) but explored deeper (6 turns vs 5). Graph costs ~6,500 tokens and increases model's navigational confidence.
CUDA implementation of Fast Walsh-Hadamard Transform (FWHT) for llama.cpp optimizing KV-cache quantization. 1-2% speedup on prefill, 7-9% on token generation with RTX 5090 and q8_0 quantization.
Call for papers for the 2nd Workshop on Efficient Reasoning at COLM 2026 (October 9). Deadline: July 12, 2026. Topics: multimodal reasoning under efficiency constraints, dataset curation, algorithmic innovations, fast inference (pruning, compression, KV-cache), benchmarks, on-device deployment, safety, real-time applications (healthcare, robotics).
Endeavour Energy, Australia's major electricity distributor, migrated its outage map to Next.js on Vercel. Results: sub-1s page loads during peak storm traffic, 5-minute data sync cycles, 38% faster deployments. Supabase handles real-time data layer.
Small specialized models (Gemma 4 31B at 86.4% on tau2-bench, Qwen 27B outperforming 397B models) now dominate agentic benchmarks. Yet the industry keeps deploying expensive frontier models: frontier labs profit from per-token billing, creating misalignment between technical performance and market adoption.
NuExtract3 is a 4B vision-language model for document understanding. It combines structured extraction (text/images + JSON template → JSON output) and image-to-Markdown conversion, with multilingual support and reasoning/non-reasoning modes. Available in GGUF, NVFP4, MLX, VLLM.
llama.cpp PR #22929 optimizes checkpoint creation to avoid full context re-processing when conversation history is edited. Use case: agentic coding with 70k tokens. Improves responsiveness by reprocessing only changed portions, tested for 2 weeks.
Method using aligned bipartite graphs and graph neural networks to detect hallucinations in LLMs. Trains a GNN on alignment structure between source documents and model outputs. SOTA results on 4 hallucination and QA datasets, outperforming GPT-4o.
arXiv paper proposing a methodology for designing AI benchmarks suited to knowledge work (coding, research, healthcare). Authors critique current evaluations that don't reflect real-world conditions and propose a 3-step framework: define the activity, specify the setting (tools, roles, constraints), score the final product. Analysis of 3 cases: GDPval, OfficeQA Pro, APEX-SWE.
Theoretical study showing that exact certification of threshold circuits (depth ≥2) and log-precision Transformers becomes exponentially hard with minimal overparametrization. Adding a single gate or constant architectural overhead forces certificate sizes exponential in input dimension. Empirical validation on binary addition recognition.
Paper introduces parallel context compaction for long-horizon LLM agents to address latency and unpredictability of sequential summarization. Enables fine-grained control over summary volume and targeted prompt engineering per block. Evaluated on HotpotQA and LoCoMo benchmarks across 8B-120B models (dense and MoE architectures).
MedExpMem is an experience memory framework enabling medical vision-language models to accumulate differential diagnosis expertise. Unlike RAG, it memorizes discriminative experiences from past diagnostic failures as pairwise differential notes. Evaluated across 11 radiology subspecialties, it improves accuracy up to 7.0% across diverse models.
Study of verbatim memorization during Fill-in-the-Middle (FIM) pretraining on Llama 3.2. FIM recovers more short or partial spans compared to standard LTR, with extraction growing linearly with repetitions. Suffix context is insufficient: memorization remains anchored in prefix context.
Researchers use Transcoders to interpret how vision-language models transform images into text. Applied to Gemma 3-4B-IT, the framework decomposes the model into computational pathways linking image patches to token generation. Transcoder attributions outperform SAEs in identifying hallucinations (AUC 0.68).
Two-stage federated recommendation pipeline for mobile devices: collaborative filtering on non-sensitive app-context data in cloud, then on-device re-ranking with sensitive mobile signals. Validated on MovieLens, UCI HAR, and proprietary dataset. Production-ready Kotlin Multiplatform library for Android/iOS.
Theoretical study showing that network depth induces implicit low-rank bias, promoting alternatives to neural collapse. Analysis of unregularized deep UFM (unconstrained feature model) training dynamics reveals how depth favors softmax codes over classical structured geometries.
Datasette 1.0a30 introduces a customizable "Jump to..." menu accessible via the `/` key. The new `jump_items_sql()` plugin hook allows plugins to add their own items to the searchable menu.
Datasette-agent 0.1a4 integrates an agent chat interface into the Jump menu (/ key) using the new makeJumpSections() JavaScript hook from Datasette 1.0a30. Enables natural language database queries directly from the UI.
DeepSeek makes its 75% discount on V4-Pro permanent: $0.435 per million input tokens, 11.5× cheaper than GPT-4.5 on input, 34× cheaper on output. This aggressive pricing could squeeze Western providers, especially for agentic systems.
Meituan releases LongCat-Video-Avatar 1.5, an open-source framework for audio-driven human avatar video generation. Upgrades audio encoder from Wav2Vec2 to Whisper-Large, supports Audio-Text-to-Video and Video Continuation with 8-step inference. Human evaluation on 508 image-audio pairs across 6 scenarios and 2 languages.
Nvidia and Hugging Face introduce Nemotron-Labs, diffusion-based language models to accelerate text generation. The approach parallelizes token generation, reducing latency compared to traditional autoregressive methods.
Llama.cpp adds Programmatic Dependent Launch (PDL) support for Nvidia Blackwell GPUs (CC >= 90). PDL improves kernel execution: +5-6% token generation speedup on Qwen 35B and Gemma 26B, no pre-fill gains. Enable with '-D GGML_CUDA_PDL=ON' at build time.
OpenAI generated $5.7 billion in Q1 2026 revenue but lost $1.22 per dollar earned, with an adjusted operating margin of -122%.
SupraLabs releases Supra-50M, a 50M-parameter model trained on 20B tokens of high-quality educational text. Llama-style architecture with 32k vocab. Outperforms GPT-2 (124M) and SmolLM-135M on multiple benchmarks (BLiMP 76.3%, SciQ 77.2%, ARC-Easy 52.2%). Roadmap includes Supra-124M and Supra-350M.
Trump cancels an AI safety executive order after last-minute calls from Musk, Zuckerberg, and Sacks. The order would have established a voluntary review system for frontier models with a 90-day pre-release window.
DeepSeek raises $10.29 billion. Founder Liang Wenfeng commits to continuing open-source AI model development over short-term commercialization. Company targets AGI.
GitHub releases a multi-platform SDK for integrating Copilot Agent into third-party apps and services. Enables developers to access Copilot's AI capabilities through a standardized API.
Microsoft releases governance toolkit for autonomous AI agents. Includes policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering. Covers all 10 OWASP Agentic Top 10 risks.
TimesFM is a pretrained foundation model developed by Google Research for time-series forecasting. The GitHub repository provides an open-source implementation of this specialized model.
FTC requires Cox Media Group and two other firms to pay nearly $1 million to settle charges they deceived customers about an "Active Listening" AI marketing service. The service claimed to listen to conversations via smart devices for ad targeting, but actually used no voice data at all.
Theoretical study of why critic-free RL methods (PPO, GRPO) improve LLMs. Authors show actor updates are value-gradient-like in expectation, and autodifferentiation through attention produces empirical costates approximating the value signal. Decomposition of RL impact into value-gradient signal and reachable reward headroom.
Mahjax is a fully vectorized Riichi Mahjong simulator in JAX for reinforcement learning on GPU. Achieves 2 million steps/sec on 8 NVIDIA A100s (no-red rules) and 1 million (red rules). Demonstrates training agents from scratch without supervised pre-training.
FlyRoute is a self-evolving agent profiling framework that improves enterprise query routing. Via a data flywheel mechanism, it collects capability evidence from real traffic, distills learned descriptions, and injects them into an LLM router with BM25-retrieved successes. On a proprietary dataset, FlyRoute improves from 72.57% (zero-shot) to 89.83% accuracy after 7,211 labeled queries.
Optimization study for industrial agent pipelines (AssetOpsBench). Proposes temporal semantic cache and MCP optimizations (tool discovery, parallel execution): 1.67x speedup, 40% latency reduction, 30.6x on cache hits. Identifies failure modes of pure semantic caching for parameter-rich queries.
FlowLM converts pre-trained diffusion language models into flow matching models via efficient fine-tuning. By realigning curved diffusion trajectories into straight-line flows, FlowLM achieves high-quality few-step text generation rivaling 2,000-step diffusion sampling. Performance saturation reached with half the training epochs compared to training from scratch.
Psy-Chronicle is a data-generation framework for synthesizing long-horizon campus psychological counseling dialogues. Authors create CPCD, a Chinese dataset of 90,000 dialogues across 100 student profiles spanning a semester, with a benchmark evaluating long-horizon memory and causal reasoning. Code and data open-sourced.
Faithful-MR1 is a training framework for MLLMs improving multimodal reasoning via reinforcement learning. It anchors visual attention directly on image regions (not via textual descriptions) and reinforces faithful use through counterfactual image intervention. Results on Qwen2.5-VL-Instruct 3B/7B with substantially less training data.
Comparative study of 5 classifiers (logistic regression, random forest, XGBoost, SVM, naive Bayes) for chronic kidney disease risk prediction. All achieve AUROC 1.00 internally (UCI, 400 patients) but collapse on external MIMIC-IV data (AUROC 0.48-0.58). Calibration and conformal coverage severely degraded. No model meets clinical deployment criteria.