Page 40 of 192

AllHigh signalRecent

7679 articles

LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories

LinTree improves LLM reasoning by explicitly representing the tree structure of search traces. Researchers show raw access to search history alone fails to reliably outperform LLM-guided heuristic search. Adding parent pointers to explicitly represent the linearized tree structure improves performance and search efficiency on Blocks World, grid Navigation, and Sokoban.

Reasoning Papers

SIG

HYP

arXiv cs.LG·Jun 1

NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

NumLeak measures memorization of public benchmarks in frontier LLMs. Models recall Fama-French data (r=0.97-0.99), US unemployment, and NOAA temperature with high fidelity. On recent unseen data, parse rate drops to 21-57% but r stays ~0.99 for answered months. A one-line system-prompt defense blocks 99.8% of attacks.

Benchmarks Evals AI safety

SIG

HYP

arXiv cs.CL·Jun 1

Exploring Autonomous Agentic Data Engineering for Model Specialization

arXiv paper on autonomous agentic data engineering for model specialization. GPT-5.2 constructs a training curriculum improving a student model by 57.29% through iterative, agent-driven data adaptation. Formalizes a novel task evaluating LLMs as autonomous data engineers.

AI Agents Fine-tuning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 1

Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

DecomposeR, a deep research framework, trains Qwen3-8B in two RL stages: planner RL learns typed DAG structures and query decomposition, then answerer RL learns branch execution and synthesis. Achieves 5.1-8.0 point improvements on long-form benchmarks through explicit planning and structured rewards.

Qwen Reinforcement learning Reasoning

SIG

HYP

arXiv cs.CL·Jun 1

Can LLM Teams Play What? Where? When?

Study on LLM teams playing ChGK (collective reasoning quiz). Three strategies tested: Voting, Silent Team (captain observes answers), Talkative Team (captain observes answers + rationales). On 572 questions from 2025, teams outperform single models (+20 points). Best team: 44.23% accuracy, approaching human performance. Sharing rationales mitigates errors.

Multi-agent Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 1

Generating Graph-like Rules for Knowledge Graph Reasoning via Diffusion Models

GRiD, a diffusion-model framework, generates graph-like rules for knowledge graph reasoning. Combines supervised pre-training and reinforcement learning to discover complex rules (cycles, branches) beyond simple chains. Evaluated on 6 benchmarks with open-source code.

Papers Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 1

MAVEN: Improving Generalization in Agentic Tool Calling

MAVEN is a lightweight symbolic reasoning scaffold to improve generalization of LLM agents in tool-calling tasks. Evaluated on BFCL v3, TauBench, Tau2Bench, AceBench and a new MAVEN-Bench benchmark, it increases GPT-OSS-120b accuracy from 48% to 71% without additional training, at roughly 1/10 the cost of proprietary baselines.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·Jun 1

Supervised Training Rapidly Degrades Early Visual Cortex Alignment Across Biologically Plausible Learning Rules

Untrained neural networks match early visual cortex better than trained networks. Study on 720 THINGS images and fMRI from 3 subjects shows one training epoch reduces V1 alignment by 25-90% depending on learning rule. Backpropagation degrades most (Δr = -0.080), while predictive coding and STDP preserve alignment better (Δr ~ -0.04).

Papers Reasoning Alignment

SIG

HYP

arXiv cs.CL·Jun 1

Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge

Method to automatically generate fine-grained evaluation rubrics without human annotation, tested on four benchmarks. Training-free approach, then iterative fine-tuning via meta-judge reward signals. A fine-tuned 14B rubric generator outperforms larger proprietary models.

Evals Fine-tuning Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 1

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

Study on harness self-evolution (prompts, skills, memories, tools) in LLM agents. Analyzes two capabilities: harness-updating (producing useful updates) and harness-benefit (benefiting from them). Findings: harness-updating is capability-agnostic (Qwen3.5-9B matches Claude Opus gains), while harness-benefit is non-monotonic (mid-tier models benefit most).

AI Agents Prompt engineering Benchmarks

SIG

HYP

arXiv cs.LG·Jun 1

DisjunctiveNet: Neural Symbolic Learning via Differentiable Convexified Optimization Layers

DisjunctiveNet introduces a neuro-symbolic framework to embed hard mixed-integer linear constraints and logical rules directly into neural networks using differentiable optimization layers. Through hierarchical convex relaxations, the approach ensures exact rule satisfaction while maintaining strong predictive performance on real-world datasets.

Reasoning Papers

SIG

HYP

Vercel AI Blog·Jun 1

Vercel Blob now supports OIDC authentication

Vercel Blob now supports OIDC authentication as default for new projects. Vercel-issued OIDC tokens are short-lived and auto-rotating, eliminating the need for long-lived tokens. Vercel Functions and CLI automatically receive the token.

Infrastructure Tools

SIG

HYP

Vercel AI Blog·May 31

Chat SDK adds Lark and Feishu support

Vercel AI Chat SDK adds support for Lark and Feishu via a new official vendor adapter. Bots can post, edit, and delete messages, stream replies via Lark's native cardkit typewriter API, send interactive cards, and react with emojis. Connection uses Lark's WebSocket transport without requiring HTTP webhook exposure.

Tools AI Agents Code generation

SIG

HYP

GitHub Trending·May 31

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> Comfy-Org /</span> ComfyUI

ComfyUI is a modular GUI for diffusion models with a node/graph-based interface, providing API and backend capabilities for image generation.

Image generation Open source Tools

SIG

HYP

Reddit r/LocalLLaMA·May 31

PolyRange: Contamination-resistant offensive-AI benchmark for web targets (that ain't a benchmark, THAT's a benchmark)

PolyRange is a cybersecurity AI benchmark that dynamically generates fresh web targets for each evaluation, eliminating training corpus contamination. The author addresses consensus from labs (Anthropic, OpenAI, DeepMind): static benchmarks are saturated and real-world defenses are missing. MIT-licensed, independent from the author's commercial project.

Benchmarks AI safety Evals

SIG

HYP

Vercel AI Blog·May 31

MiniMax M3 on AI Gateway

MiniMax M3, MiniMax's first model with 1M-token context window and native multimodality, is now available on Vercel AI Gateway. M3 excels at software engineering, terminal-based tool use, and agentic web browsing, optimized for multi-turn collaboration.

AI Agents Code generation Vision

SIG

HYP

Reddit r/LocalLLaMA·May 31

mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF just released !

Mudler releases APEX GGUF quantizations of Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled with bundled MTP (multi-token prediction) head. Files enable self-speculative decoding via llama.cpp without separate draft model. Size +2.5% vs non-MTP version, MTP head quantized Q8_0 for high draft accuracy.

Qwen Code generation Open source

SIG

HYP

Simon Willison·May 30

How we contain Claude across products

Anthropic publishes detailed documentation on sandboxing techniques across Claude.ai, Claude Code, and Cowork. Uses gVisor (Claude.ai), Seatbelt/Bubblewrap (Claude Code local), and full VMs (Cowork). Includes process sandboxes, filesystem boundaries, and egress controls to prevent credential exfiltration.

Claude Claude Code Anthropic

SIG

HYP

Reddit r/LocalLLaMA·May 30

nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face

NVIDIA quantized Alibaba's Qwen3.6-35B-A3B model to NVFP4 (4-bit) using Model Optimizer. Weight reduction from 16 to 4 bits per parameter cuts GPU memory and disk size by ~3.06x. Benchmark results show minimal accuracy loss: MMLU Pro 85.6→85.0, GPQA Diamond 84.9→84.8.

Qwen Fine-tuning Benchmarks

SIG

HYP

Hacker News (AI)·May 30

OpenRouter raises $113M Series B

OpenRouter raises $113M Series B. The LLM API aggregation platform strengthens funding to expand model offerings and infrastructure capabilities.

OpenAI Business Infrastructure

SIG

HYP

The Decoder·May 30

Making AI chatbots helpful weakens their ability to simulate human behavior, large-scale study finds

Large-scale study (208,000 participants, 26 million responses) reveals that training making language models helpful weakens their ability to replicate human behavior. The effect worsens with each model generation. Demographic profiles (persona trick) provide no meaningful benefit for individual predictions.

Alignment Evals Papers

SIG

HYP

The Decoder·May 30

OpenAI's Codex can now operate your Windows PC autonomously, hunting bugs and testing apps on its own

OpenAI deploys Codex on Windows 11 with 'Computer Use' feature enabling AI to autonomously control programs, test applications, and detect bugs. ChatGPT mobile app allows users to launch and monitor these tasks remotely.

OpenAI Code generation AI Agents

SIG

HYP

ActuIA·May 29

Anthropic à 965 Md$ : série H de 65 milliards, aucun fonds public européen au tour

Anthropic raises $65 billion in Series H funding, reaching a $965 billion valuation. No European public funds participated in the funding round.

Anthropic Funding Business

SIG

HYP

GitHub Trending·May 29

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> anthropics /</span> claude-code

Claude Code is an agentic coding tool in the terminal that understands your codebase and executes routine tasks, explains complex code, and handles git workflows through natural language commands.

Claude Claude Code AI Agents

SIG

HYP

Vercel AI Blog·May 29

Function invocations now billed per unit

Vercel shifts to per-unit billing for function invocations. New rate: $0.0000006 per invocation (previously $0.60 per million) for Pro customers. Change effective next billing cycle.

Infrastructure Business

SIG

HYP

Le Big Data·May 29

Anthropic dépasse 965 milliards de dollars grâce à sa Série H

Anthropic raises $65 billion in Series H funding, reaching a $965 billion valuation. One of the largest funding rounds in the AI sector.

Anthropic Funding Business

SIG

HYP

Reddit r/LocalLLaMA·May 29

Liquid AI releases LFM2.5-8B-A1B

Liquid AI releases LFM2.5-8B-A1B, 8B model with 128K context window, 38T pre-training tokens, and large-scale RL. Doubled vocabulary for non-Latin languages. Supports tool chaining and complex tasks on entry-level laptops.

Open source Code generation AI Agents

SIG

HYP

arXiv cs.CL·May 29

S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

S3MEM introduces a structured scene-event episodic memory framework for long-horizon interactive agents. The system structures trajectories into organized memory units and uses anchor-sensitive retrieval to improve spatiotemporal question answering. Evaluated on Crafter, Jericho, SciWorld, and ALFWorld, S3MEM outperforms Vanilla RAG and Graph-NoReader in accuracy while using fewer evidence tokens.

RAG AI Agents Reasoning

SIG

HYP

arXiv cs.LG·May 29

Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation

Hybrid ML-expert framework for evaluating organic synthesis routes. DeepSets model trained on tree edit distance, fine-tuned with chemist annotations. Produces quantitative scores and explainable categories (Good/Plausible/Bad). Spearman correlation 0.78, top-1 accuracy 60.2% vs 17.5% baseline.

Papers Benchmarks Fine-tuning

SIG

HYP

arXiv cs.AI·May 29

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

Large-scale literature search study: Deep Research pipeline increases recall from below 20% to above 80% on RollingEval-Jun25 (250-paper benchmark). Critical analysis of human reference lists as ground truth: only 51% judged moderately relevant vs 86-88% for best AI re-rankers. Humans cite direct collaborators 2.5x more often.

RAG Evals Benchmarks

SIG

HYP

arXiv cs.LG·May 29

One Mask to Rule Them All: On Hidden Facts after Editing and How to Find Them

Knowledge editing methods ROME and MEMIT modify transformer MLP weights. Authors identify a common subset of weights targeted across diverse edits using a binary mask that reverses 80% of edits on training set and 70% on test set. The mechanism suppresses rather than overwrites knowledge, explaining why changes fail to propagate to related facts.

Papers Reasoning AI safety

SIG

HYP

arXiv cs.AI·May 29

Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents

Empirical analysis of 11 DeFi agents on Solana: treasuries retain $30M in paper gains while token holders collectively lost $191.7M. Top 1% of wallets capture 81.4% of gains. Token valuations disconnected from fundamentals (market-cap-to-AUM ratios >10,000x). Median returns negative across all platforms.

AI Agents Benchmarks Business

SIG

HYP

arXiv cs.CL·May 29

Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction

Study of lossy semantic text compression where an encoder strategically deletes text parts and an LLM reconstructs original content. Benchmarks 6 deletion strategies (uniform, frequency, entropy, LP-optimized, hybrid) on BBC News. WordFreq provides best cost/performance ratio; semantic methods excel at moderate compression; QLoRA fine-tuning competes with Gemini 2.0 Flash.

Benchmarks Reasoning Fine-tuning

SIG

HYP

arXiv cs.CL·May 29

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

Empirical study of behavioral reproducibility in LLM agents with tool-calling capabilities. Researchers measure whether agents select the same tools, in the same order, with identical parameters, across repeated identical invocations. Focus on structured tool-calling interfaces with typed parameters and consequential side effects.

AI Agents Benchmarks AI safety

SIG

HYP

arXiv cs.AI·May 29

The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models

Masked diffusion models (MDMs) with confidence-based decoding fail on complex reasoning tasks. Confidence-aligned training amplifies errors by an order of magnitude on multi-digit addition. Random masking better preserves the logical trajectories required for reasoning.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 29

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Empirical study on LLM-generated reviews for scientific papers (ACL Rolling Review 2025 data). Findings: limited alignment between LLM and human reviews, substantial variation across prompts and models. Authors can 'game' LLM reviews through iterative revision workflows, increasing scores for up to 35% of tested papers.

Evals Benchmarks Alignment

SIG

HYP

arXiv cs.AI·May 29

Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

Longitudinal analysis of ~12,000 Microsoft Bing Copilot users reveals individual behavior patterns remain sticky over time despite population-level trends. Active users achieve higher success rates and tackle complex, professional tasks. WildChat-4.8M dataset skewed toward proficient power users.

Evals Benchmarks

SIG

HYP

arXiv cs.CL·May 29

Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning

Aryabhata 2 is a STEM reasoning language model trained via reinforcement learning on GPT-OSS-20B. Developed by PhysicsWallah, it outperforms its base model on JEE/NEET competitive exams while reducing output tokens by up to 64%. Evaluated on AIME, HMMT, MMLU-Pro, and GPQA.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 29

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

OmniRetrieval is a framework unifying retrieval across heterogeneous knowledge sources (unstructured text, relational tables, knowledge graphs). It translates natural-language queries into source-native queries, evaluated on 13 datasets and 309 knowledge bases.

RAG Vector search Papers

SIG

HYP

Simon Willison·May 29

datasette 1.0a31

Datasette 1.0a31 adds two major features: execution of write queries (INSERT/UPDATE/DELETE) and saving stored queries (private or shared). Permissions control access to sensitive operations like CREATE TABLE.

Tools Open source

SIG

HYP