Function invocations now billed per unit
Vercel shifts to per-unit billing for function invocations. New rate: $0.0000006 per invocation (previously $0.60 per million) for Pro customers. Change effective next billing cycle.
Vercel shifts to per-unit billing for function invocations. New rate: $0.0000006 per invocation (previously $0.60 per million) for Pro customers. Change effective next billing cycle.
Anthropic raises $65 billion in Series H funding, reaching a $965 billion valuation. One of the largest funding rounds in the AI sector.
Liquid AI releases LFM2.5-8B-A1B, 8B model with 128K context window, 38T pre-training tokens, and large-scale RL. Doubled vocabulary for non-Latin languages. Supports tool chaining and complex tasks on entry-level laptops.
Empirical study on LLM-generated reviews for scientific papers (ACL Rolling Review 2025 data). Findings: limited alignment between LLM and human reviews, substantial variation across prompts and models. Authors can 'game' LLM reviews through iterative revision workflows, increasing scores for up to 35% of tested papers.
OmniRetrieval is a framework unifying retrieval across heterogeneous knowledge sources (unstructured text, relational tables, knowledge graphs). It translates natural-language queries into source-native queries, evaluated on 13 datasets and 309 knowledge bases.
Longitudinal analysis of ~12,000 Microsoft Bing Copilot users reveals individual behavior patterns remain sticky over time despite population-level trends. Active users achieve higher success rates and tackle complex, professional tasks. WildChat-4.8M dataset skewed toward proficient power users.
Masked diffusion models (MDMs) with confidence-based decoding fail on complex reasoning tasks. Confidence-aligned training amplifies errors by an order of magnitude on multi-digit addition. Random masking better preserves the logical trajectories required for reasoning.
Large-scale literature search study: Deep Research pipeline increases recall from below 20% to above 80% on RollingEval-Jun25 (250-paper benchmark). Critical analysis of human reference lists as ground truth: only 51% judged moderately relevant vs 86-88% for best AI re-rankers. Humans cite direct collaborators 2.5x more often.
Empirical study of behavioral reproducibility in LLM agents with tool-calling capabilities. Researchers measure whether agents select the same tools, in the same order, with identical parameters, across repeated identical invocations. Focus on structured tool-calling interfaces with typed parameters and consequential side effects.
Hybrid ML-expert framework for evaluating organic synthesis routes. DeepSets model trained on tree edit distance, fine-tuned with chemist annotations. Produces quantitative scores and explainable categories (Good/Plausible/Bad). Spearman correlation 0.78, top-1 accuracy 60.2% vs 17.5% baseline.
Aryabhata 2 is a STEM reasoning language model trained via reinforcement learning on GPT-OSS-20B. Developed by PhysicsWallah, it outperforms its base model on JEE/NEET competitive exams while reducing output tokens by up to 64%. Evaluated on AIME, HMMT, MMLU-Pro, and GPQA.
S3MEM introduces a structured scene-event episodic memory framework for long-horizon interactive agents. The system structures trajectories into organized memory units and uses anchor-sensitive retrieval to improve spatiotemporal question answering. Evaluated on Crafter, Jericho, SciWorld, and ALFWorld, S3MEM outperforms Vanilla RAG and Graph-NoReader in accuracy while using fewer evidence tokens.
Study of lossy semantic text compression where an encoder strategically deletes text parts and an LLM reconstructs original content. Benchmarks 6 deletion strategies (uniform, frequency, entropy, LP-optimized, hybrid) on BBC News. WordFreq provides best cost/performance ratio; semantic methods excel at moderate compression; QLoRA fine-tuning competes with Gemini 2.0 Flash.
Empirical analysis of 11 DeFi agents on Solana: treasuries retain $30M in paper gains while token holders collectively lost $191.7M. Top 1% of wallets capture 81.4% of gains. Token valuations disconnected from fundamentals (market-cap-to-AUM ratios >10,000x). Median returns negative across all platforms.
Knowledge editing methods ROME and MEMIT modify transformer MLP weights. Authors identify a common subset of weights targeted across diverse edits using a binary mask that reverses 80% of edits on training set and 70% on test set. The mechanism suppresses rather than overwrites knowledge, explaining why changes fail to propagate to related facts.
Datasette 1.0a31 adds two major features: execution of write queries (INSERT/UPDATE/DELETE) and saving stored queries (private or shared). Permissions control access to sensitive operations like CREATE TABLE.
Anthropic reports annualized run-rate revenue of $47 billion as of May 2026, up from $9 billion end-2025. Growth accelerates: $14 billion in February, $30 billion in April. Metric disclosed during $65 billion Series H funding round.
StepFun releases Step 3.7 Flash, a 196B/11B active MoE multimodal model with built-in 1.8B ViT. SWE-Bench Pro: 56.26% (beats DeepSeek V4 Flash 55.6%), DeepSearchQA F1: 92.82%. Runs locally on 128GB RAM.
Anthropic releases Claude Opus 4.8, described as a "modest but tangible improvement" over 4.7. The model excels in honesty: 4x less likely to let code flaws pass unremarked, and abstains more on uncertain questions. Pricing unchanged: $5/M input tokens, $25/M output.
Anthropic raises $65 billion in Series H funding at $965 billion post-money valuation. This major funding round reflects continued investor confidence in the company's AI development trajectory.
AgingBench, a new longitudinal deployment benchmark, shows that swapping Claude Sonnet 4.6 for Opus 4.7 in the Claude Code CLI agent drops PyTest pass rate by ~15%. Memory policy alone drives a 4.5x spread in agent half-life across scenarios, larger than any model swap tested.
Zai replaced the network architecture on a 1000-GPU cluster running GLM-5.1 from ROFT to ZCube (developed with Tsinghua and HarnetsAI). Results: switch/optical costs down 33%, GPU throughput up 15%, P99 first-token latency down 40.6%. ZCube removes the Spine layer for full bipartite interconnect, eliminating asymmetric traffic hotspots inherent to Prefill-Decode disaggregated inference.
MONET, an Apache 2.0 dataset of 104.9M high-quality images with captions and metadata, released on Hugging Face. Built from 2.9B images and refined. Includes paper, UMAP visualization, text/image retrieval tool, and codebase for training T2I models.
Claude Code is an agentic coding tool in the terminal that understands your codebase and executes routine tasks, explains complex code, and handles git workflows through natural language commands.
Microsoft releases RAMPART, a pytest-native safety and security testing framework for agentic AI applications. Enables evaluation of security and safety risks in multi-agent systems.
Qwen3.6-35B-A3B-APEX quantized by mudler achieves 37 t/s generation with 72K filled context on RTX 3060 12GB via 17.3GB offloading. Spiritbuun's CUDA optimizations (fused MMA, TurboQuant, fattn) + APEX I-Compact quantization yield PPL 3.25. 128K context supported, degrades to 28 t/s @129K.
Cognition raises $1B in Series D at $26B valuation. The company behind Devin, an AI coding agent, positions code as an uncapped TAM market.
Claude Opus 4.8 is now available on Vercel AI Gateway. The model excels at long-horizon agentic execution and complex multi-step coding tasks. AI Gateway provides unified API access with usage tracking, performance optimizations, and transparent pricing with no markup.
Nvidia releases LocateAnything, a 3B vision-language grounding model. Uses parallel box decoding, 10x faster than Qwen3-VL. Code and demo available on HuggingFace.
Training Qwen models (1.7B, 4B, 8B) on Codenames game to improve creativity via Reinforcement Learning with Verifiable Rewards (RLVR). 8B model gains creativity (+8/10 benchmarks) with minor reasoning degradation, while smaller models prioritize precision. Study on creativity-precision trade-off across model scales.
CAROL is a probabilistic framework for test-time hallucination reduction in LLMs. It defines semantic uncertainty based on consistency between generated responses and trusted context, formulating mitigation as a Markov chain accept-reject process with convergence guarantees. Results on QA and multi-agent reasoning benchmarks show significant hallucination reduction.
C-MIG introduces a multi-view information gain-based RAG framework for clinical diagnosis reasoning. It replaces exact-match binary rewards with information gain estimation from two views (retrieved documents and document refinement) to better supervise LLM reasoning. Experiments on four medical benchmarks show improvements over RAG-RL baselines in both in-domain and out-of-domain settings.
BBC (Beta-Bernoulli Calibrator) converts point forecasts from any LLM into probability distributions using supervision from binary outcomes and aggregated human forecasts. The model captures epistemic uncertainty through variance, outperforming post-hoc calibration and specialized fine-tuning approaches.
FLUID efficiently adapts autoregressive (AR) language models to diffusion-based generation through strictly causal alignment and elastic horizons. The framework reduces training costs by orders of magnitude by reusing existing GPT checkpoints while maintaining state-of-the-art performance.
NVIDIA's GB10 edge AI hardware (ASUS Ascent GX10) lacks CPU energy counters and monitoring interfaces (IPMI, SCMI). Only instantaneous GPU power is exposed via NVML. Agentic workloads consume 4.33x more energy than linear baselines. Per-process energy attribution remains impossible on this platform unlike x86/RAPL.
Theoretical study of fundamental limits in card payment fraud detection. Authors formalize payment authorization as sequential decision problem with delayed, censored, and corrupted feedback. They derive minimax regret lower bound showing that improving data quality has greater impact than increasing model complexity.
Researchers show Local SGD exposes anisotropic loss geometry through worker disagreement. Worker-average gaps provide a Hessian-free estimator of dominant spectral directions. Validated on MLPs, CNNs, and Transformers.
Paper proposing SMARt, a formal framework for managing autonomy in agentic AI systems. Introduces managed autonomy theory based on epistemic drift detection, reasoning suspension, and escalation to human control. Uses timed Petri nets to guarantee safety and governance properties.
PAST2HARM is an adaptive jailbreak attack exploiting past tense reformulation to bypass safeguards in multimodal text-to-image models. Tested on Gemini Nano, GPT Image 2, and SD XL, it achieves 83%, 67%, and 100% success rates. The attack generates explicit sexual content, political disinformation, and hate speech.
UserHarness proposes a framework to improve agent Theory-of-Mind by explicitly reconstructing user mental state. The system decomposes user observations, beliefs, intentions, and actions. Across five benchmarks, UserHarness achieves 95.94% macro accuracy, outperforming existing methods by over 15% relative improvement.