GPT-4
OpenAI releases GPT-4, a multimodal model accepting image and text inputs. Achieves human-level performance on professional and academic benchmarks, though less capable than humans in many real-world scenarios.
OpenAI releases GPT-4, a multimodal model accepting image and text inputs. Achieves human-level performance on professional and academic benchmarks, though less capable than humans in many real-world scenarios.
OpenAI introduces ChatGPT, a model trained to interact conversationally. The dialogue format enables ChatGPT to answer follow-up questions, admit mistakes, challenge incorrect premises, and reject inappropriate requests.
Hugging Face releases MTEB, a massive benchmark for evaluating text embedding models. Covers 58 languages, 8 tasks (retrieval, clustering, classification, etc.) and 112 datasets. Enables systematic comparison of embedding model performance.
OpenAI releases Whisper, a speech recognition model trained on 680,000 hours of multilingual data. The system handles multiple languages, accents, and background noise with robustness exceeding existing models.
Hugging Face introduces BLOOM, the world's largest open multilingual language model. Trained on 46 languages, BLOOM matches proprietary state-of-the-art models in performance while ensuring open accessibility.
OpenAI releases Triton 1.0, an open-source Python-like GPU programming language. It enables researchers without CUDA experience to write efficient GPU code, matching expert-level performance in most cases.
OpenAI introduces DALL·E, a neural network that generates images from text captions in natural language, covering a wide range of expressible concepts.
OpenAI introduces CLIP, a neural network that efficiently learns visual concepts from natural language supervision. CLIP enables zero-shot visual classification by simply providing category names, without task-specific training.
OpenAI publishes foundational research on few-shot learning capabilities in language models. LLMs can perform tasks with minimal examples without fine-tuning, revealing emergent rapid adaptation capacity.
OpenAI publishes research on scaling laws for neural language models, establishing predictable relationships between model size, training data, and performance. Results enable optimization of compute resource allocation.
OpenAI introduces the Sparse Transformer, a deep neural network setting new records in sequence prediction (text, images, sound). Its improved attention mechanism processes sequences 30x longer than previously possible.
OpenAI trained a large-scale unsupervised language model generating coherent paragraphs, achieving state-of-the-art performance on multiple language modeling benchmarks, and performing reading comprehension, machine translation, question answering, and summarization without task-specific training.
OpenAI created a bot that defeats world-class Dota 2 professionals in 1v1 matches under standard tournament rules. The bot learned through self-play without imitation learning or tree search, advancing toward AI systems achieving well-defined goals in complex real-world environments.
Systematic audit of FOLIO and MALLS benchmarks reveals 39% and 36% errors in FOL formalizations respectively. Authors release corrected annotations and an LLM-based framework to guide manual relabeling, achieving 90% dataset accuracy by reviewing <24% of instances versus >70% for unguided review. Testing on Gemma 31B, Qwen3-30B, and GPT-4o-mini shows +9 to +22 percentage point accuracy gains.
Comprehensive benchmark of 8 tiny LLMs (135M–1B) on Jetson Orin Nano Super 8GB with llama.cpp CUDA across 4 power modes (7W–MAXN). 25W mode optimal: SmolLM2-135M achieves 165 tok/s and 22.6 tok/J; LFM2.5-1.2B best in ~1B class (54.1 tok/s). 384 benchmark cells, raw datasets published.
BitsMoE introduces spectral-energy-guided bit allocation for MoE LLM quantization. Using SVD decomposition, it preserves shared basis unquantized and fine-grained quantizes expert-specific factors via integer linear programming. On Qwen3-30B at 2-bit, it improves accuracy by 27.83 percentage points and increases decoding speed 1.76× over GPTQ.
CSRP, a three-stage framework for Chinese grammatical error correction, combines continual pre-training (5.9M samples), Chain-of-Thought fine-tuning, and policy optimization with efficiency-aware rewards. Achieves 50.99 F₀.₅ on NACGEC and outperforms GPT-4 on spelling correction (59.61 F1).
LithoGRPO combines flow matching with GRPO-based reinforcement learning to optimize lithography masks in semiconductor manufacturing. The framework integrates explicit physics-based reward functions and proposes a fast shot-counting algorithm achieving 130x speedup. State-of-the-art results over optimization and learning-based methods.
Delayed per-step reward attribution method for training LLM agents in multi-agent strategic interaction. An 8-billion-parameter open-source model trained with this approach matched or surpassed GPT-5 and won both Open and Efficient tracks at MindGames Arena benchmark (NeurIPS 2025).
Multi-domain red teaming framework evaluating 11 LLMs across 690 clinical scenarios. Results: substantial variance (scores 0.791–0.984), safety-critical failures masked by aggregate accuracy, 10-20% error amplification on equity tasks. Hybrid evaluation (automated + human validation) essential.
llama.cpp b9455 merges a major fix for KV cache quantization in tensor mode on multi-GPU. The solution extends the meta backend to properly handle tensor flattening without losing shape information, avoiding changes to compute graphs.
mistral.rs v0.8.2 achieves up to 2.8x faster CUDA inference than llama.cpp on Gemma 4 (dense and MoE) across GB10, B200, and H100. Reproducible results published with Q4K and eQ8_0 support, includes OpenAI-compatible server.
Researchers reveal that statistical watermarks in LLMs are vulnerable to linear ensembles. Averaging probability distributions across 3-5 models cancels out watermark perturbations. WASH (Watermark Attenuation via Statistical Hybridisation) defeats detection across 6 watermarking schemes, reducing z-scores from 5-300 to <2 (threshold: 4), while improving output quality by 27.5%.
GLIDE is an open-source Python library unifying prediction-powered inference methods (PPI++, Stratified PPI, Predict-Then-Debias) for evaluating agentic systems. It combines human annotations and LLM judgments into unbiased estimates with valid confidence intervals, reducing annotation costs while maintaining precision.
A new counterfactual evaluation metric (CSS) reveals that six frontier models ranked similarly on traditional coverage-based metrics rank in nearly opposite order when assessed on their ability to update clinical recommendations in response to oncology case mutations. All models fail on surgery-status interventions, a safety blind spot invisible to coverage metrics.
VeriGate extends GRPO by combining verifier rewards with step-level supervision. The method uses a Process Reward Model (PRM) to assign fine-grained credit to tokens, avoiding gradient collapse when all trajectories receive identical rewards. On MATH with Qwen2.5-Instruct (1.5B/7B), VeriGate improves accuracy by ~20% and ~12% respectively.
Eywa is a provenance-grounded memory architecture for persistent AI agents, storing immutable source evidence before deriving facts and validating memories against typed signals. Retrieval uses a deterministic multi-route read path with zero LLM calls. Results: 90.19% judge accuracy on LoCoMo C1-C4, 88.2% on LongMemEval-S, 81.45% mean nugget score on BEAM.
Multi-model study (Pythia-1.4B, Gemma-2, Qwen2.5-7B, Llama-3.1-8B) on linear representations of synthetic dishonesty. Linear probes detect deception with AUC ≥0.99 as early as layers 1-3. Dishonesty representations consolidate progressively in deeper layers, with implications for activation-based monitoring.
LongDS-Bench evaluates AI agents' ability to maintain analytical context over long horizons. The benchmark contains 68 multi-turn data analysis tasks (2,225 turns) from real Kaggle notebooks. Best models reach only 48.45% accuracy, with a 47-point performance drop from early to late turns. Long-horizon errors account for 52–69% of failures.
NVIDIA Parakeet speech-to-text ported to C++/ggml without Python or PyTorch. Byte-for-byte identical output to NeMo, up to 5x faster on GPU for larger models, 600x realtime on audio clips. Quantized GGUFs (f16, q8_0, q6_k, q5_k, q4_k), flat C API, integrated in LocalAI with OpenAI-compatible endpoint.
Flash Attention optimization for llama.cpp on RDNA3 GPUs: 47% VRAM reduction vs Vulkan f16. Packs four 8-bit K-values into native sudot4 instructions without lossy quantization. At 128k context with MTP draft: 21.76 GiB vs 23.18 GiB (1.42 GiB savings). Quality preserved: mean KLD 0.00455 (q4_0 V), 97.06% identical top tokens.
Optimized monokernel for LLM inference on AMD MI300X: 3,300 output tokens/s per request (batch 1, no speculative decoding). Architecture mapped to GPU physical topology. Initial support for 2B model, frontier MoE planned.
Research on probe-targeted fine-tuning (LoRA) for verbal confidence calibration in LLMs. Models internally detect correct answers (0.76–0.88 AUROC) but output 99% confidence uniformly. Fine-tuning across 8 models (7B–70B) with causal activation patching (ρ=0.976). Code and pre-registration available.
BenchTrace is a benchmark for evaluating self-evolution ability in LLM agents. Built on 1,821 annotated episodes across six tasks, it measures reflection quality and tests whether agents avoid past failures. Experiments on Qwen3-32B and GPT-4.1: <30% pass rate on reflection evaluation, agents forget early lessons and fail to generalize reflections.
Mathematical framework for surrogate modeling of oil reservoirs (Norne, 46×112×22 grid) using Fourier Neural Operators (FNO) and physics-informed variant (PINO). Empirical validation: R²>0.99 (oil), R²>0.90 (gas), R²≈0.80 (pressure) over 3298 days. 10⁴× speedup vs OPM simulator, 1000-member ensemble in <1 min on B200 GPU.
Comprehensive evaluation of 14 open-source safety guard models on 79,331 samples across 8 NIST AI Risk Framework categories. Qwen Guard (4B) achieves highest recall (83.97%), outperforming Llama Guard (12B) and GPT-OSS Safeguard (20B). Model size does not correlate with safety detection performance.
MechELK is a mechanistic interpretability framework for extracting latent knowledge from LLMs. Through three stages (localization via SAE, verification by causal probing, elicitation via representation engineering), it achieves 84.7% accuracy on TruthfulQA, outperforming CCS by 6.2% and identifies 78.3% of hidden knowledge when model output is incorrect.
Systematic study comparing state space models (SSM) for time series classification. S4D outperforms Mamba variants in accuracy and efficiency. Authors introduce MS4 and MS4N, lightweight S4D variants with linear input projection and channel-mixing. Evaluation on 59 datasets (MONSTER, UEA): MS4N matches models 10× larger in parameters.
HQMQ, a calibration-free KV cache compression method for LLMs, quantizes each 4-element chunk as a Hurwitz quaternion. Tested on Mistral-7B, Llama-3-8B, Qwen2.5/3-8B, and gpt-oss-20b: matches fp16 quality at ~5 bits, achieves up to 5.05× compression (Llama-3-70B: 43 GB → 8.5 GB), outperforms naive int4 by 3–1900×.
Laguna M.1 (225.8B parameters, 23.4B activated) and Laguna XS.2 (33.4B total, 3B activated) are two MoE foundation models trained end-to-end for agentic coding. Competitive on SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0. XS.2 released under Apache 2.0.