Page 32 of 192

AllHigh signalRecent

7679 articles

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

PROTEA is an interface for offline debugging and refinement of multi-agent LLM workflows. It evaluates intermediate outputs with configurable rubrics, localizes bottlenecks via workflow graph visualization, and generates targeted prompt revisions. On two production-adjacent workflows, PROTEA improves document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38.

Multi-agent AI Agents Prompt engineering

SIG

HYP

arXiv cs.CL·May 19

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Vision-OPD introduces regional-to-global self-distillation to improve fine-grained visual understanding in MLLMs. The framework transfers the model's privileged perception on evidence-centered crops to its full-image policy via token-level KL divergence minimization on on-policy rollouts. Competitive results on fine-grained visual understanding benchmarks without external models or ground-truth labels.

Vision Reinforcement learning Papers

SIG

HYP

arXiv cs.LG·May 19

ReTAMamba: Reliability-Aware Temporal Aggregation with Mamba for Irregular Clinical Time Series Prediction

ReTAMamba proposes a Mamba-based architecture for predicting irregular clinical time series. The model estimates observation reliability from missingness and elapsed time, integrates short/long-term information via Chronological Weaving, and uses a budgeted token router. On MIMIC-IV, eICU, and PhysioNet 2012, AUPRC gains of 7.51%, 7.80%, and 10.15% respectively.

Benchmarks Reasoning Papers

SIG

HYP

arXiv cs.CL·May 19

Rethinking Table Pruning in TableQA: From Sequential Revisions to Gold Trajectory-Supervised Parallel Search

TabTrim, a novel table pruning framework for TableQA, replaces sequential revisions with gold trajectory-supervised parallel search. The system uses intermediate sub-tables from gold SQL queries to train a pruner and verifier. TabTrim-8B achieves 73.5% average accuracy, outperforming strongest baselines by 3.2% (79.4% on WikiTQ, 61.2% on TableBench).

Benchmarks Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

QuickLAP: Quick Language-Action Preference Learning for Semi-Autonomous Agents

QuickLAP fuses physical and language feedback to learn robot reward functions in real time using a Bayesian framework. LLMs extract reward feature attention masks and preference shifts from free-form utterances, integrated with physical corrections via closed-form update rule. Achieves 70% error reduction vs physical-only and heuristic multimodal baselines in semi-autonomous driving simulator.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·May 19

ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning

ALIGN is a vision-language framework to infer precise accident coordinates from Bangla news reports and map-based cues. Using an agentic architecture combining OCR, LLM, and vision-language models, the system reduces localization error from 10.9 km to 0.593 km on validation data and 0.465 km on official Dhaka Metropolitan Police records.

Vision AI Agents Multi-agent

SIG

HYP

Reddit r/MachineLearning·May 18

Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]

Custom CUDA runtime for small-batch inference (robotics, VLA, world models). Bottlenecks are not GEMM alone but runtime overhead: kernel fragmentation, layout transitions, precision conversions (FP8/FP4), Python scheduling. Results: Pi0.5 on RTX 5090 ~17.6ms, GROOT N1.6 ~12.5-13.1ms, Qwen 27B ~129 tok/s.

Code generation Infrastructure Robotics

SIG

HYP

Reddit r/MachineLearning·May 18

Sub-JEPA: a simple fix to LeCun group's LeWorldModel that consistently improves performance [P]

Sub-JEPA improves LeWorldModel (LeCun's group, NYU) by applying Gaussian regularization within frozen random orthogonal subspaces instead of globally. Gains up to +10.7 pp on Two-Room, straighter latent trajectories, better physical state decodability. Code and paper released.

Reasoning Papers Benchmarks

SIG

HYP

Vercel AI Blog·May 1

How GitBook serves 30,000 sites with sub-second content updates

GitBook hosts 30,000 documentation sites on Vercel, serving 120 million monthly page views. The platform uses Next.js `use cache` directive to invalidate cache in under 300ms per site, processing 40,000 daily invalidations. 41% of traffic comes from AI crawlers.

Infrastructure Code generation Tools

SIG

HYP

OpenAI Blog·Feb 2

Snowflake and OpenAI partner to bring frontier intelligence to enterprise data

OpenAI and Snowflake announce a $200M partnership to embed frontier AI models directly into Snowflake's data platform. Customers can deploy AI agents and extract insights without data movement. Native integration of OpenAI models within Snowflake's ecosystem.

OpenAI Business AI Agents

SIG

HYP

OpenAI Blog·Jan 9

OpenAI and SoftBank Group partner with SB Energy

OpenAI and SoftBank Group partner via SB Energy to build multi-gigawatt AI data center campuses, including a 1.2 GW Texas facility supporting the Stargate initiative.

OpenAI Infrastructure Business

SIG

HYP

OpenAI Blog·Dec 11

Advancing science and math with GPT-5.2

OpenAI releases GPT-5.2, its strongest model for math and science, achieving state-of-the-art results on GPQA Diamond and FrontierMath benchmarks. The model solves an open theoretical problem and generates reliable mathematical proofs.

GPT OpenAI Benchmarks

SIG

HYP

OpenAI Blog·Nov 13

Introducing GPT-5.1 for developers

OpenAI releases GPT-5.1 in API with faster adaptive reasoning, extended prompt caching, improved coding performance, and new apply_patch and shell tools.

GPT OpenAI Code generation

SIG

HYP

Hugging Face Blog·Sep 25

Llama can now see and run on your device - welcome Llama 3.2

Meta releases Llama 3.2 with native vision capabilities and device-optimized versions. The model processes images and text natively, available in 1B and 3B variants for on-device execution.

Llama Vision Open source

SIG

HYP

Hugging Face Blog·Jul 31

Google releases Gemma 2 2B, ShieldGemma and Gemma Scope

Google releases Gemma 2 2B, a lightweight model optimized for inference. ShieldGemma provides protection against harmful content. Gemma Scope offers interpretability tools to analyze model internals.

Gemini Open source AI safety

SIG

HYP

Hugging Face Blog·Jul 1

Our Transformers Code Agent beats the GAIA benchmark 🏅

Hugging Face's Transformers Code Agent achieves 92% accuracy on the GAIA benchmark, outperforming Claude 3.5 Sonnet (92%) and GPT-4o (87.9%). The agent combines web search, code execution, and multi-step reasoning to solve complex tasks.

AI Agents Code generation Benchmarks

SIG

HYP

Hugging Face Blog·Apr 15

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

Hugging Face releases Idefics2, an open-source 8B vision-language model handling images and text. Supports up to 1024×1024 pixel resolution and runs on standard hardware. Available under Apache 2.0 license with public weights and code.

Vision Open source Tools

SIG

HYP

Hugging Face Blog·Feb 28

StarCoder2 and The Stack v2

Hugging Face releases StarCoder2, open-source code model trained on The Stack v2 (17B code tokens dataset). StarCoder2 outperforms CodeLlama on HumanEval and MBPP benchmarks. Available in 3B, 7B, and 15B variants with open weights and source code.

Code generation Open source Benchmarks

SIG

HYP

OpenAI Blog·Dec 14

Weak-to-strong generalization

OpenAI explores leveraging deep learning's generalization properties to control strong models with weak supervisors. New research direction for superalignment with promising initial results.

OpenAI Alignment Reasoning

SIG

HYP

OpenAI Blog·Jun 23

Learning to play Minecraft with Video PreTraining

OpenAI trains a neural network to play Minecraft using Video PreTraining (VPT) on unlabeled human gameplay videos. The model learns to craft diamond tools (24,000-action task) with minimal labeled data. It uses native human interface (keyboard/mouse) and represents progress toward general computer-using agents.

OpenAI AI Agents Vision

SIG

HYP

Vercel AI Blog·Jun 18

The Agent Stack

Vercel introduces 'The Agent Stack', a complete framework for building production-grade AI agents. It combines AI SDK (unified multi-model interface), AI Gateway (centralized routing and billing), and enables calling Claude, GPT and others without vendor lock-in.

AI Agents Claude GPT

SIG

HYP

arXiv cs.AI·Jun 18

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

Xcientist is a research harness that externalizes research synthesis and experimental validation for AI scientists into inspectable, contract-governed processes. It organizes literature evidence, idea states, implementation plans, and repair traces as persistent research artifacts, eliminating claim drift where runnable artifacts no longer support the originally claimed mechanism.

AI Agents Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 18

Skill-Guided Continuation Distillation for GUI Agents

SGCD, an iterative self-improvement framework, addresses off-trajectory states in GUI agents. The system first runs a plain policy, then uses a skill-guided policy to generate successful continuations. On OSWorld-Verified, SGCD improves success rates of three base models from ~30% to over 50%.

AI Agents Reinforcement learning Papers

SIG

HYP

arXiv cs.LG·Jun 18

Beyond AHI: An Interpretable Causal-Discovery-Guided Framework for Sleep Recovery in Connected Health

Causal framework for sleep recovery scoring from multimodal polysomnography. Uses DAG learning on two cohorts (MESA n=1540, MrOS n=825) to identify five physiological domains (respiratory burden, hypoxia, fragmentation, architecture, autonomic regulation). Sleep Recovery Score (SRS) achieves 2.5× stronger alignment with perceived recovery than standard AHI.

Papers Reasoning Evals

SIG

HYP

arXiv cs.LG·Jun 18

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero uses LLM agents with tree search to discover adaptive RL training strategies. The system identifies that capacity parameters accumulate monotonically while regularization parameters oscillate. Across 4 GRPO tasks, discovered strategies outperform the base model by 9-140% and grid search by 6-15%.

Reinforcement learning AI Agents Reasoning

SIG

HYP

arXiv cs.LG·Jun 18

ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots

ASTRA is an air traffic control training simulator automating pilot roles through speech recognition, instruction interpretation, and response generation. The system reduces Word Error Rate from 107.80% to 23.45% on Singaporean-accented aviation speech, and evaluates trainee radiotelephony communications achieving 91.7% accuracy, 88.2% brevity, and 86.9% completeness scores.

Voice Fine-tuning Evals

SIG

HYP

arXiv cs.LG·Jun 18

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

SAGE is a post-hoc method to improve selective unlearning in LLMs. It corrects final update vectors by suppressing components damaging retention, without rerunning the original unlearning pipeline. Tested across multiple methods and scales, SAGE reduces the forget-retain trade-off.

Alignment Papers

SIG

HYP

arXiv cs.CL·Jun 18

LLM Parameters for Math Across Languages: Shared or Separate?

Mechanistic analysis of mathematical reasoning in multilingual LLMs. Math-associated parameters exhibit partial cross-lingual overlap, concentrated in intermediate layers. English produces the largest set of math-relevant parameters, while lower-resource languages reveal smaller parameter sets.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.AI·Jun 18

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

Safety Reflection Pretraining inserts short safety reflections into pretraining corpora to establish self-monitoring directly in language modeling. On 1.7B models pretrained on FineWeb-Edu, the method improves safety classification accuracy and substantially reduces success rates of inference-stage and finetuning attacks.

AI safety Alignment Reinforcement learning

SIG

HYP

arXiv cs.CL·Jun 18

Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications

Framework for customization and efficient deployment of LLM-based multi-agent systems in enterprise settings. Combines continual pretraining, supervised fine-tuning, and preference optimization to adapt compact models to specialized domains. Integrates speculative decoding and FP8 quantization to reduce latency and costs. Achieves 4.48x throughput speedup while maintaining performance.

Multi-agent Fine-tuning Business

SIG

HYP

arXiv cs.AI·Jun 18

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

Novel LLM personalization: store user facts as surgical edits in a hash-keyed memory table (Engram) instead of global LoRA. Reduces memory footprint by 33,000x, improves indirect-reasoning accuracy by 5.6x on average, and enables stacking multiple users without cross-contamination.

Fine-tuning Reasoning Papers

SIG

HYP

The Decoder·Jun 17

Amazon, Nvidia, and AMD bet $310 million on AI startup building 3D world models

Amazon, Nvidia, and AMD invest $310 million in Odyssey ML, a 3D world model startup valued at $1.45 billion. IQT fund and Google's Jeff Dean join the round. World models are emerging as the next major AI bet after language models.

Funding Reasoning Vision

SIG

HYP

The Decoder·Jun 17

Zhipu AI's GLM-5.2 closes in on closed-source leaders in coding marathons

Zhipu AI releases GLM-5.2 under MIT license with stable 1-million-token context. On FrontierSWE benchmark for long-duration coding tasks, the open-source model trails Anthropic's Claude Opus 4.8 by just one percentage point. Significant gap remains on reasoning versus closed-source rivals.

Open source Code generation Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 17

Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5

Gemma 4 E2B runs in-browser at 255 tokens/sec using WebGPU kernels optimized by Fable 5. Demo and kernels released on Hugging Face.

Gemini Code generation Open source

SIG

HYP

Vercel AI Blog·Jun 17

Vercel Ship 2026 recap

Vercel unveils agent-first infrastructure at Ship 2026 in London. Three core components: Agent Stack (building blocks for agents), Vercel Connect (secure external tool access without persistent tokens), and eve (open-source framework for production agents with durable execution, sandboxed compute, approvals, and evals).

AI Agents Infrastructure Tools

SIG

HYP

The Decoder·Jun 17

Nvidia research shows robots that train themselves through AI coding agents

Researchers from Nvidia, Carnegie Mellon University, and UC Berkeley use AI coding agents to teach robots dexterous grasping in real-world conditions. A fleet of eight robots achieves 99% success rate on complex tasks.

AI Agents Code generation Robotics

SIG

HYP

Le Big Data·Jun 17

DeepSeek réalise une levée géante de plus de 7 milliards de dollars

DeepSeek closes a funding round exceeding $7 billion, among the largest in the AI sector. Record amount for the Chinese startup specializing in language models.

DeepSeek Funding Business

SIG

HYP

GitHub Trending·Jun 17

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> DeusData /</span> codebase-memory-mcp

High-performance code intelligence MCP server. Indexes codebases into persistent knowledge graph in milliseconds. Supports 158 languages, sub-ms queries, 99% fewer tokens. Single static binary, zero dependencies.

MCP Code generation RAG

SIG

HYP

GitHub Trending·Jun 17

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> google-research /</span> timesfm

TimesFM is a pretrained foundation model developed by Google Research for time-series forecasting. The GitHub repository provides an open-source implementation of this specialized model.

DeepMind Open source Benchmarks

SIG

HYP

GitHub Trending·Jun 17

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> bytedance /</span> UI-TARS-desktop

ByteDance releases UI-TARS-desktop, an open-source multimodal AI agent stack connecting cutting-edge AI models and agent infrastructure. Platform for building agents capable of interacting with user interfaces.

AI Agents Multi-agent Open source

SIG

HYP