Page 2 of 192

AllHigh signalRecent

7679 articles

Introducing GPTs

OpenAI launches GPTs, custom versions of ChatGPT combining instructions, extra knowledge, and various skills without requiring coding.

GPT OpenAI Tools

SIG

HYP

Hugging Face Blog·Sep 6

Spread Your Wings: Falcon 180B is here

Hugging Face announces the release of Falcon 180B, an open-source large language model with 180 billion parameters. The model is available in base and instruction-tuned versions, designed for complex text generation and reasoning tasks.

Open source Llama Benchmarks

SIG

HYP

Hugging Face Blog·Jul 18

Llama 2 is here - get it on Hugging Face

Meta releases Llama 2, an open-source language model available on Hugging Face. The model comes in multiple sizes and can be used freely for research and commercial applications.

Llama Open source Meta AI

SIG

HYP

OpenAI Blog·Mar 14

GPT-4

OpenAI releases GPT-4, a multimodal model accepting image and text inputs. Achieves human-level performance on professional and academic benchmarks, though less capable than humans in many real-world scenarios.

GPT OpenAI Vision

SIG

HYP

OpenAI Blog·Nov 30

Introducing ChatGPT

OpenAI introduces ChatGPT, a model trained to interact conversationally. The dialogue format enables ChatGPT to answer follow-up questions, admit mistakes, challenge incorrect premises, and reject inappropriate requests.

OpenAI GPT

SIG

HYP

Hugging Face Blog·Oct 19

MTEB: Massive Text Embedding Benchmark

Hugging Face releases MTEB, a massive benchmark for evaluating text embedding models. Covers 58 languages, 8 tasks (retrieval, clustering, classification, etc.) and 112 datasets. Enables systematic comparison of embedding model performance.

Embeddings Benchmarks Evals

SIG

HYP

OpenAI Blog·Sep 21

Introducing Whisper

OpenAI releases Whisper, a speech recognition model trained on 680,000 hours of multilingual data. The system handles multiple languages, accents, and background noise with robustness exceeding existing models.

OpenAI Voice Open source

SIG

HYP

Hugging Face Blog·Jul 12

Introducing The World's Largest Open Multilingual Language Model: BLOOM

Hugging Face introduces BLOOM, the world's largest open multilingual language model. Trained on 46 languages, BLOOM matches proprietary state-of-the-art models in performance while ensuring open accessibility.

Open source Llama Benchmarks

SIG

HYP

OpenAI Blog·Jul 28

Introducing Triton: Open-source GPU programming for neural networks

OpenAI releases Triton 1.0, an open-source Python-like GPU programming language. It enables researchers without CUDA experience to write efficient GPU code, matching expert-level performance in most cases.

Open source Infrastructure Code generation

SIG

HYP

OpenAI Blog·Jan 5

DALL·E: Creating images from text

OpenAI introduces DALL·E, a neural network that generates images from text captions in natural language, covering a wide range of expressible concepts.

OpenAI Image generation Vision

SIG

HYP

OpenAI Blog·Jan 5

CLIP: Connecting text and images

OpenAI introduces CLIP, a neural network that efficiently learns visual concepts from natural language supervision. CLIP enables zero-shot visual classification by simply providing category names, without task-specific training.

OpenAI Vision Benchmarks

SIG

HYP

OpenAI Blog·May 28

Language models are few-shot learners

OpenAI publishes foundational research on few-shot learning capabilities in language models. LLMs can perform tasks with minimal examples without fine-tuning, revealing emergent rapid adaptation capacity.

GPT OpenAI Prompt engineering

SIG

HYP

OpenAI Blog·Jan 23

Scaling laws for neural language models

OpenAI publishes research on scaling laws for neural language models, establishing predictable relationships between model size, training data, and performance. Results enable optimization of compute resource allocation.

OpenAI Benchmarks Papers

SIG

HYP

OpenAI Blog·Apr 23

Generative modeling with sparse transformers

OpenAI introduces the Sparse Transformer, a deep neural network setting new records in sequence prediction (text, images, sound). Its improved attention mechanism processes sequences 30x longer than previously possible.

OpenAI Reasoning Benchmarks

SIG

HYP

OpenAI Blog·Feb 14

Better language models and their implications

OpenAI trained a large-scale unsupervised language model generating coherent paragraphs, achieving state-of-the-art performance on multiple language modeling benchmarks, and performing reading comprehension, machine translation, question answering, and summarization without task-specific training.

OpenAI GPT Benchmarks

SIG

HYP

OpenAI Blog·Aug 11

Dota 2

OpenAI created a bot that defeats world-class Dota 2 professionals in 1v1 matches under standard tournament rules. The bot learned through self-play without imitation learning or tree search, advancing toward AI systems achieving well-defined goals in complex real-world environments.

OpenAI Reinforcement learning AI Agents

SIG

HYP

arXiv cs.CL·Jun 18

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Montreal Forced Aligner 3.0, the reference tool since 2016 for forced speech-to-text alignment, achieves state-of-the-art performance on English, Japanese, and Korean with boundary errors <15ms. New capabilities: model adaptation, cross-language phone remapping, expanded language/dialect coverage, harmonized IPA dictionaries.

Voice Benchmarks Open source

SIG

HYP

arXiv cs.AI·Jun 18

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

TxBench-PP is a verified benchmark evaluating AI agents on small-molecule preclinical pharmacology. 100 evaluations span mechanism-of-action, pharmacodynamics, compound-target engagement, and safety. Across 16 configurations (11 models, 4,800 trajectories), Claude Opus 4.8 achieves 59.3% success rate, GPT-5.5 55.3%. No system reliably masters these decisions.

AI Agents Benchmarks Claude

SIG

HYP

arXiv cs.AI·Jun 18

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb is a benchmark of 372,648+ instances for evaluating defeasible abduction reasoning in language models. Best frontier models reach 65% under standard conditions but drop to 23.5% under rendering-robust evaluation, versus 100% for symbolic logic solvers. The benchmark includes three difficulty levels with polynomial-time verifiable gold standards.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 18

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Morpheus is a morphology-aware neural tokenizer for agglutinative Turkish. The model uses differentiable Poisson-binomial dynamic programming to segment morphemes with 1.425 bits-per-character compression and MorphScore macro-F1 of 0.61 (vs ~0.32 for subword tokenizers). Lossless by construction: decode(encode(w)) = w.

Embeddings Papers Open source

SIG

HYP

arXiv cs.LG·Jun 18

ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets

ThousandWorlds is an ML benchmark for climate emulation of potentially habitable exoplanets. The dataset contains ~1800 simulations from 5 global climate models mapping 8 planetary parameters to 3D atmospheric fields. Three nested subsets and two evaluation protocols test 7 baselines; GP-based methods outperform standard deep learning.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

JetFlow improves speculative decoding by combining parallel drafting efficiency with branch-wise causal conditioning. On H100 GPUs, it achieves 9.64x speedup on MATH-500 and 4.58x on open-ended conversations, outperforming existing tree-based methods on dense and MoE Qwen3 models.

Benchmarks Code generation Open source

SIG

HYP

arXiv cs.AI·Jun 18

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

ProfiLLM is an agentic LLM pipeline deployed at DiDi to extract semantic user profiles from massive behavioral logs. The system uses 27 analytical tools to mine platform-scale data and generates utility-aligned profiles, achieving +6.14% AUC improvement and +0.47% GMV gain in A/B testing.

AI Agents Llama RAG

SIG

HYP

Simon Willison·Jun 17

GLM-5.2 is probably the most powerful text-only open weights LLM

Z.ai released GLM-5.2 (753B parameters, 40 active via MoE) under MIT license on June 16th. Text-only model with 1M token context window. Ranks 1st on Artificial Analysis Intelligence Index v4.1 (score 51) ahead of DeepSeek V4 Pro and Kimi K2.6. 2nd on Code Arena WebDev behind Claude Fable 5.

Open source Benchmarks Code generation

SIG

HYP

arXiv cs.AI·Jun 17

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

EComAgentBench is a benchmark of 662 e-commerce tasks evaluating LLM-based shopping agents on hidden intents distributed across query, user profile, and clarifications. Requirements are scattered and agents must uncover them within 100 tool calls. The strongest model achieves only 57.1% accuracy.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.AI·Jun 17

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

PreAct compiles successful runs of computer-using agents into small state-machine programs, replayed 8.5-13x faster with no per-step LLM calls. An independent evaluator validates each program before storage. Across three benchmarks (mobile, desktop, web), this verification prevents faulty program accumulation (+1.75-2.6 tasks).

AI Agents Code generation Benchmarks

SIG

HYP

arXiv cs.LG·Jun 17

Rift: A Conflict Signature for Deception in Language Models

Researchers identify an internal signature of deception in language models: deceptive responses show 2.1-2.3x higher residual rank than naively false answers. This signature detects deception with 100% accuracy on GPT-2, Qwen2.5, and Phi-3, and transfers zero-shot across model families and languages (AUC 0.933-1.0).

AI safety Alignment Evals

SIG

HYP

arXiv cs.LG·Jun 17

ProCUA-SFT Technical Report

ProCUA-SFT is a dataset of 3.1M step-level SFT samples generated automatically from 93K synthetic trajectories across 2,484 application combinations. Fine-tuning UI-TARS 7B on ProCUA-SFT achieves 45.0% on OSWorld, a +18.7 percentage-point improvement over the base model and +35% above AgentNet. The pipeline uses Kimi-K2.5 as task generator, precondition judge, and trajectory executor.

AI Agents Benchmarks Fine-tuning

SIG

HYP

arXiv cs.AI·Jun 17

FllumaOne: A Code-Native Multimodal CAD Dataset with Executable Programs and Kernel-Validated Feature Histories

FllumaOne is a multimodal CAD dataset of 100,000 models generated by executable Python programs in Flluma (OpenCASCADE-based CAD system). Each sample aligns the program with a feature tree, STEP representation, point cloud, and natural-language descriptions. A Qwen2.5-Coder-1.5B baseline achieves 99.98% Python syntax validity and 99.14% STEP-export validity.

Code generation Benchmarks Vision

SIG

HYP

arXiv cs.CL·Jun 17

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

AIPatient Arena evaluates LLMs in multi-turn clinical consultation across 8 competence dimensions using EHR-grounded knowledge graphs. On 437 patients, models excel in questioning (4.43-4.99/5) and ethical conduct (4.38-4.93/5), but fail in diagnostic accuracy (2.63-3.55/5) and information coverage (2.08-3.02/5). Weaknesses include repetitive questioning, omitted medical history, inadequate uncertainty handling.

Evals Reasoning AI safety

SIG

HYP

arXiv cs.AI·Jun 17

How Inference Compute Shapes Frontier LLM Evaluation

Study evaluating 12 frontier models on inference compute impact across seven benchmarks. Three interventions tested: larger token budgets, context compaction, repeated submission attempts. Results: increased budgets substantially improve performance on FrontierMath, Humanity's Last Exam, TerminalBench. Fixed-budget evaluations increasingly understate newer model capabilities.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.LG·Jun 17

The Discrete-Log Clock: How a Transformer Learns Modular Multiplication

Researchers show that a transformer learning modular multiplication uses multiplicative character transform rather than standard DFT. On a·b mod 113, the spectrum becomes sparse (Gini 0.58 vs 0.07), with 96.9% of MLP neurons tuned to a single frequency. The algorithm implements a "Discrete-Log Clock" reducing multiplication to addition in discrete-log space.

Reasoning Papers Evals

SIG

HYP

Reddit r/MachineLearning·Jun 16

quicktok: a faster tokenizer (exact and byte-identical with tiktoken) [P]

quicktok is a BPE tokenizer written in C++ producing byte-identical tokens to tiktoken. Encodes 2–3.6× faster than bpe-openai and 4–11× faster than tiktoken itself. Supports cl100k, o200k, GPT-OSS, Llama-3, Qwen2.5/3. Optimizations: 2-byte trie, dense caches, hand-compiled pretokenizer.

Code generation Tools Open source

SIG

HYP

arXiv cs.LG·Jun 16

Transformers Learn the Mestre-Nagao Heuristic

Two-layer transformers classify rational elliptic curves (rank 0 vs 1) with >99% accuracy from 128 Frobenius traces. Mechanistic interpretability analysis reveals a sparse circuit of 20 neurons implements the Mestre-Nagao heuristic (weights log(p)/(p·log B), r=0.997), autonomously discovering an analytic number theory result.

Reasoning Evals Papers

SIG

HYP

arXiv cs.CL·Jun 16

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

NVIDIA introduces Nemotron 3 Ultra, a 550B-parameter (55B active) Mamba-Transformer MoE hybrid model pre-trained on 20T tokens with 1M context length. Uses SFT, RL, and multi-teacher distillation. Achieves ~6x inference throughput of public LLMs with comparable accuracy. Base, post-trained, and quantized checkpoints, training data, and recipe open-sourced on HuggingFace.

AI Agents Reasoning Open source

SIG

HYP

arXiv cs.AI·Jun 16

PrologMCP: A Standardized Prolog Tool Interface for LLM Agents

PrologMCP exposes Prolog as a stateful tool via Model Context Protocol for LLM agents. Tested on PARARULE-Plus with Claude Sonnet 4.6, GPT-4.1, and o4-mini, the system achieves 1.00 accuracy on the general set and 0.99–1.00 on the challenging set, outperforming reasoning models on deductive tasks.

MCP AI Agents Reasoning

SIG

HYP

arXiv cs.AI·Jun 16

CODA-BENCH: Can Code Agents Handle Data-Intensive Tasks?

CODA-BENCH is the first benchmark jointly evaluating code and data intelligence in AI agents. Built on the Kaggle ecosystem with 1,009 tasks and ~980 files per environment, it reveals that top agents achieve only 61.1% success rate when integrating data discovery with code execution.

AI Agents Benchmarks Code generation

SIG

HYP

arXiv cs.CL·Jun 15

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

CacheRL trains small agent models (Qwen3-4B-Thinking) achieving 92% accuracy on multi-step tool-calling tasks with 100× less compute than GPT-5 (94%). Three innovations: hybrid thinking trajectory pipeline with LLM-generated reasoning, three-tier fuzzy cache eliminating live execution costs, cache-tier-aware rewards. SFT + GRPO improve validation reward from 0.43 to 0.78.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·Jun 15

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

Every Eval Ever introduces a unified schema and community repository to standardize AI evaluation results. The system ingests 22,235 models and 2,273 benchmarks through a single JSON format, with automatic converters from popular harnesses and leaderboards. Solves fragmentation of results scattered across incompatible formats.

Evals Benchmarks Open source

SIG

HYP

arXiv cs.CL·Jun 15

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

Reliability study of LLM-as-a-Judge: GPT-4o-mini and GPT-4.1-mini show significant instability with 13.6% average preference flips, 28% of questions exceeding 20% flip rate. Position bias detected (72% A-majority). Cross-judge agreement 76% (κ=0.51). 11 repeated trials needed for 95% confidence.

Evals GPT OpenAI

SIG

HYP