Page 33 of 192

AllHigh signalRecent

7679 articles

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> microsoft /</span> RD-Agent

Microsoft releases RD-Agent, an autonomous AI system to automate R&D processes in data science and ML. The agent drives experiments, data analysis, and model iterations without human intervention.

AI Agents Multi-agent Open source

SIG

HYP

arXiv cs.AI·Jun 17

Dissecting model behavior through agent trajectories

Study of harness-model alignment via 138k agent trajectories. Authors introduce Simple Strands Agent (SSA), a generic harness tested on Claude, Gemini, GPT, Grok, Qwen across SWE-Pro, SWE-Verified, and Terminal-Bench-2. Beyond pass@1 scores, analysis reveals fine-grained behavioral differences: edit frequency, testing activity, phase transitions.

AI Agents Benchmarks Code generation

SIG

HYP

arXiv cs.LG·Jun 17

Toward Controllable Catalyst Inverse Design via Large-Scale Autoregressive Pretraining

Generative model based on GPT architecture for inverse design of heterogeneous catalysts. Pretrained on 133 million structures, fine-tuned on ~460,000 optimized structures. Achieves 98% structural validity, 95% optimization validity, and improves screening efficiency 1.5–4× for reaction-targeted catalyst discovery.

Papers Benchmarks Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 17

Environment-Grounded Automated Prompt Optimization for LLM Game Agents

Automated prompt optimization framework for LLM agents in interactive environments. Decomposes observation-to-action pipeline into descriptor and action-selection agents, iteratively refines via LLM-driven evolutionary loop guided by environment returns. On BabyAI/BALROG: improves from 0% to 72.5% success on PutNext without fine-tuning.

AI Agents Prompt engineering Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 17

DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack

DeepInsight is a unified evaluation infrastructure for Physical AI stacks, spanning three orders of magnitude from foundation-model decoding to full-body physics simulation. It uses three invariant abstractions (task, resource, result) to preserve regime heterogeneity while enabling cross-layer regression diagnostics impossible with federated per-segment harnesses.

Reasoning Evals Robotics

SIG

HYP

arXiv cs.AI·Jun 17

SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions

SkillChain-Gym is a benchmark for reskilling-aware production-inventory control. The environment models skill decay, certification lapses, training actions, and capacity constraints. Evaluation of production-only, reactive adaptive, and static-insurance policies over 60-shift horizons with operational and resilience metrics.

Benchmarks Reinforcement learning AI Agents

SIG

HYP

arXiv cs.AI·Jun 17

WallZero: Mastering the Game of WallGo with Strategic Analysis

WallZero, an AlphaZero-based agent, masters WallGo, a strategic board game popularized by Netflix's The Devil's Plan (2025). On a 7×7 board, the agent defeats professional Go players with 1.98x more territory on average. Authors analyze game fairness and identify key strategies.

Reinforcement learning Benchmarks Papers

SIG

HYP

arXiv cs.AI·Jun 17

DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue

DiagFlowBench evaluates how language models handle off-procedure inputs in industrial diagnostic dialogue. A dataset of 1,676 multi-turn conversations derived from 50 diagnostic flowcharts reveals models often select a real but contextually inadequate step rather than hallucinate, exposing a vulnerability: plausible but wrong advice grounded in documentation.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.LG·Jun 17

Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines

Quantized integer-only transformer implementation for jet tagging on AMD Versal AI Engine (AIE). Reusable software framework automatically converts Python model descriptions to Vitis graph code for low-latency, resource-constrained deployment. Open-source release.

Vision Benchmarks Open source

SIG

HYP

arXiv cs.CL·Jun 17

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

GameCraft-Bench evaluates coding agents' ability to generate playable games end-to-end in Godot. The benchmark comprises 140 tasks across 15 game families. Top agents achieve only 41.46% success, revealing struggles to produce complete games with sufficient content and coherent visual feedback.

Code generation AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings

LongWebBench is a benchmark evaluating long-horizon webpage generation by vision-language models. It contains 490 real-world pages for structural evaluation and 507 goal-oriented interaction tasks over 129 pages. Experiments show structural fidelity degrades with webpage length, and visually plausible generations often fail to support multi-step executable interactions.

Vision Benchmarks AI Agents

SIG

HYP

arXiv cs.LG·Jun 17

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

Systematic evaluation of foundation model representations (5 FMs) on computational pathology tasks using whole-slide images and transcriptomic profiles (IH-BC, IH-NSCLC cohorts). Multimodal fusion improves performance when no single modality dominates. Conformal prediction shows true diagnosis remains recoverable in prediction sets for majority of failed predictions.

Vision Benchmarks AI safety

SIG

HYP

arXiv cs.AI·Jun 17

FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow

FlowRAG improves graph-based retrieval-augmented generation through a multi-granularity heterogeneous graph (passages, summaries, sentences, entities) and frequency-aware weighted flow module. This enhances semantic recall and explicit reasoning for complex multi-hop tasks.

RAG Reasoning Benchmarks

SIG

HYP

Vercel AI Blog·Jun 17

Introducing eve

Vercel introduces eve, an open-source agent framework for building and deploying agents in production. eve provides built-in infrastructure (model management, fallbacks, logging); developers define only behavior through files (agent.ts, instructions.md, tools). Inspired by Next.js for the web, eve standardizes agent building as Next.js did for web applications.

AI Agents Open source Tools

SIG

HYP

arXiv cs.AI·Jun 17

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

SEAGym is an evaluation environment for measuring self-evolving LLM agent harness updates (prompts, memory, tools, interaction loop). The study compares ACE, TF-GRPO, and AHE on Terminal-Bench 2.0 and HLE, showing frequent updates don't guarantee held-out performance gains and source diversity affects harness reliability.

AI Agents Reinforcement learning Evals

SIG

HYP

arXiv cs.CL·Jun 17

When English Isn't the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning

Empirical study of cross-lingual transfer in In-Context Learning (ICL) spanning 7 tasks, 6 models, and typologically diverse languages. Results show that fine-tuning-based expectations do not consistently apply in the ICL regime, proposing alternative heuristics for effective source language selection.

Benchmarks

SIG

HYP

Vercel AI Blog·Jun 17

Introducing Vercel Connect

Vercel Connect, now in Public Beta, replaces long-lived stored tokens with runtime credential exchange. Agents receive short-lived, task-scoped credentials through reusable connectors (Slack, GitHub, etc.), eliminating risks from permanent token leaks.

AI Agents Tools Infrastructure

SIG

HYP

OpenAI Blog·Jun 17

Introducing LifeSciBench

OpenAI introduces LifeSciBench, an expert-authored and expert-reviewed benchmark for evaluating AI systems on real-world life science research tasks and decisions.

Benchmarks OpenAI Evals

SIG

HYP

Vercel AI Blog·Jun 17

Vercel Passport is now in Public Beta

Vercel Passport, access control tool for deployments, enters public beta. Centralizes authentication via Okta, Auth0, or OIDC providers. Pricing: $100/project/month, unlimited external users.

Tools Infrastructure

SIG

HYP

Vercel AI Blog·Jun 16

Vercel for Enterprise Apps and Agents

Vercel launches Enterprise Apps and Agents platform to safely deploy internal AI agents. Vercel Passport authenticates access via identity providers (Okta, Entra, Auth0), while a credential management solution consolidates OAuth, OIDC, and secret injection.

AI Agents Infrastructure AI safety

SIG

HYP

Simon Willison·Jun 16

datasette 1.0a34

Datasette 1.0a34 adds tools to insert, edit and delete rows directly in the web interface. These long-overdue features are available on table and row pages, inspired by Datasette Agent which now supports SQL write operations.

Tools Open source

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Scaling former VibeThinker-1.5B to 3B — now it reaches frontier math & coding performance

VibeThinker-3B achieves 94.3 on AIME'26, 80.2 on LiveCodeBench v6, and 96.1% pass rate on unseen LeetCode contests. The model demonstrates small models can reach frontier-level reasoning performance in math and coding through clear verification signals.

Reasoning Benchmarks Code generation

SIG

HYP

The Decoder·Jun 16

DeepSeek takes outside money for the first time at a $50 billion valuation

DeepSeek raises 50 billion yuan ($7.4 billion) in its first external funding round, reaching a $50 billion valuation.

DeepSeek Funding Business

SIG

HYP

arXiv cs.CL·Jun 16

Are Online Skill and Memory Modules Always Worth Their Tokens? A Budget-Constrained Study of Web Agents

Comparative study of memory and skill modules for web agents. On WebArena and WorkArena, a vanilla baseline with equivalent token budget matches or exceeds AWM, ASI, and ReasoningBank. Results across Gemini 3 Flash, GPT-4o-mini, Qwen 3.6-27B show apparent gains vanish against a budget-matched actor.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.CL·Jun 16

A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

Practical evaluation method for long-form simultaneous speech-to-speech translation (SimulS2ST) on continuous input. Uses ASR, forced alignment, and sentence embeddings to recover timestamps and align target text to source sentences, then computes sentence-level latency and quality metrics (YAAL, xCOMET). Reveals substantial latency accumulation in current systems on long speech.

Voice Evals Benchmarks

SIG

HYP

arXiv cs.CL·Jun 16

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

AgentViSS benchmark evaluates visual social intelligence of multimodal agents in social simulations. 240 scenarios, 585 roles, 2,340 instances test whether MLLMs use visual cues (expressions, posture, gaze) to guide interactions. Seven models evaluated show gap: expression and conflict handling near saturation, interaction regulation and visually grounded outcomes remain substantially harder.

Benchmarks Vision AI Agents

SIG

HYP

arXiv cs.AI·Jun 16

S1-DeepResearch: Beyond Search, Toward Real-World Long-Horizon Research Agents

S1-DeepResearch introduces a unified trajectory construction paradigm for deep research agents combining closed-ended QA and open-ended exploration. The 32B model achieves SOTA performance among open-source models on 20 benchmarks spanning complex reasoning, knowledge synthesis, report generation, and file understanding.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 16

Context Compression Is Not One Thing: Readable Symbolic Re-expression vs. Coherent Summary at Matched Budget

Telegraph English, a readable symbolic format, rewrites retrieved passages into structured entity-relation statements for context compression. On MuSiQue, TwoWiki, and HotpotQA, it outperforms three matched-budget baselines (deletion, truncation, sub-sampling) by 13–20 F1 points, and exceeds coherent prose summaries on the hardest dataset.

RAG Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 16

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

QPILOTS optimizes flow-matching and diffusion policies at inference time via Q-steering. The method projects noisy intermediate actions to clean action estimates before evaluating the critic, avoiding numerical instability. Results: 90% success rate across 50 offline-to-online tasks, and outperforms existing approaches on 6 manipulation tasks with frozen VLA models.

Reinforcement learning AI Agents Reasoning

SIG

HYP

arXiv cs.AI·Jun 16

Towards End-to-End Automation of AI Research

The AI Scientist automates the entire research lifecycle: idea generation, coding, experiments, data analysis, manuscript writing, and peer review. An AI-generated manuscript passed the first round at a major ML conference workshop (70% acceptance rate). The system leverages foundation models within a complex agentic architecture.

AI Agents Multi-agent Papers

SIG

HYP

arXiv cs.LG·Jun 16

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

StarOR synergizes Monte Carlo Tree Search with test-time reinforcement learning for optimization modeling. The framework decomposes modeling into four stages, refines a transient LoRA adapter via GRPO at each node, and employs an unsupervised multi-faceted reward system. Achieves state-of-the-art results across five optimization benchmarks with a 4B backbone.

Reasoning Reinforcement learning Fine-tuning

SIG

HYP

arXiv cs.LG·Jun 16

Rational Sparse Autoencoder

Sparse autoencoders (SAEs) for mechanistic interpretability rely on fixed activations (ReLU, JumpReLU, TopK). This paper introduces Rational Sparse Autoencoder (RSAE), replacing the fixed encoder activation with a trainable rational function. RSAE improves reconstruction and sparsity trade-offs across three open-weight language models while maintaining feature-level interpretability.

Papers Evals Open source

SIG

HYP

arXiv cs.AI·Jun 16

Towards Verifiable Agentic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning

IRTS-ToolBench, a benchmark of 1,700 questions across 10 task types and 13 domains, evaluates how LLMs and AI agents handle irregular time series (asynchronous, informative missing values, variable sampling frequencies). Bridges gap between existing TSQA benchmarks (regular data) and real-world deployments.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·Jun 16

Exploring Starts Are Not Enough: Counterexamples and a Fix for Monte Carlo Exploring Starts

Study of convergence properties of Monte Carlo Exploring Starts (MCES) in tabular reinforcement learning. Authors construct counterexamples showing MCES can converge to suboptimal solutions despite initial exploration. A modification scaling learning rates inversely to update frequencies guarantees convergence to optimality.

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.CL·Jun 16

Pepti-Agent: An AI Agent for Peptide Design and Optimization

Pepti-Agent is an AI framework for therapeutic peptide design using Model Context Protocol (MCP). An LLM controller orchestrates independent tools: generation via PeptideGPT, property prediction (solubility, hemolysis, fouling) via ProtBERT, and residue-by-residue mutation. The system traces each decision to enable multi-objective benchmarking and experimental validation.

AI Agents MCP Reasoning

SIG

HYP

arXiv cs.CL·Jun 16

T-Mem: Memory That Anticipates, Not Archives

T-Mem proposes a long-term conversational memory architecture that overcomes lexical and vector similarity bounds. The system introduces write-time triggers to enable two recall modes: descriptive (surface features) and associative (latent semantic arcs). T-Mem achieves state-of-the-art on LoCoMo and LoCoMo-Plus benchmarks.

AI Agents RAG Benchmarks

SIG

HYP

arXiv cs.CL·Jun 16

AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels

AthDGC is an open dependency-parsed treebank of Greek spanning 8 diachronic periods (Archaic to Modern) under PROIEL XML 2.0 schema. Verse-level cross-alignment of New Testament with Latin, Gothic, Old Church Slavonic, and Classical Armenian. Annotation via Stanford Stanza, sentence alignment via LaBSE, word alignment via multilingual-BERT. v0.4 released open-source.

Benchmarks Open source Embeddings

SIG

HYP

Hacker News (AI)·Jun 15

Prediction and Entropy of Printed English - Claude Shannon (1950) [pdf]

Reposting of Claude Shannon's foundational 1950 paper on prediction and entropy of printed English. Classic theoretical work in information theory, foundational to modern language models.

Papers Reasoning

SIG

HYP

Simon Willison·Jun 15

datasette-agent 0.3a0

datasette-agent 0.3a0 introduces execute_write_sql, a new tool enabling AI agents to modify databases with user approval and permission management. Example: inserting pelican sighting data with confirmation before execution.

AI Agents Tools Open source

SIG

HYP

Vercel AI Blog·Jun 15

Vercel Functions can now run up to 30 minutes

Vercel Functions now support execution durations up to 30 minutes (vs 800 seconds) for Node.js and Python on Pro/Enterprise plans. Fluid Compute bills active CPU only, suited for LLM calls, database queries, and document processing.

Infrastructure AI Agents Reasoning

SIG

HYP