Page 34 of 192

AllHigh signalRecent

7679 articles

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> trycua /</span> cua

Open-source infrastructure for computer-use agents. Provides sandboxes, SDKs, and benchmarks to train and evaluate AI agents capable of controlling full desktops (macOS, Linux, Windows).

AI Agents Open source Benchmarks

SIG

HYP

ActuIA·Jun 15

Les États-Unis coupent l'accès aux modèles Fable 5 et Mythos 5 d'Anthropic : un précédent pour la souveraineté IA

The US has required Anthropic to restrict access to its most advanced models Fable 5 and Mythos 5 to foreign nationals. Anthropic disabled these models for all non-US users, setting a precedent for sovereign control of advanced AI systems.

Anthropic Regulation Business

SIG

HYP

Reddit r/LocalLLaMA·Jun 15

I ported EXL3 to run well on Apple Silicon - PonyExl3

EXL3 codec ported to Apple Silicon using Metal backend. M5 Max achieves ~600 tok/s prefill and ~38 tok/s generation (Qwen 27B), outperforming RTX 4090 on some benchmarks (68.5-80 tok/s decode). GitHub repo with reproducible results.

Open source Code generation Infrastructure

SIG

HYP

arXiv cs.CL·Jun 15

Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

Study on UTF-8 reliability in byte-level models (355M params, 80B multilingual tokens). UTF-8 validity converges 2× slower than perplexity (4.2B vs 2.1B tokens). Rare characters generate more valid UTF-8 than frequent ones, revealing over-specialization of common character representations.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·Jun 15

StreamMemBench: Streaming Evaluation of Agent Memory for Future-Oriented Assistance

StreamMemBench is an evaluation benchmark for testing AI agent memory in realistic scenarios. It constructs two-step task sequences around video observations (EgoLife) to measure whether agents use stored evidence and reuse user feedback. Tests on 8 memory systems show current agents often fail to convert feedback into reliable follow-up behavior.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.CL·Jun 15

QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning

QIAS 2026 is a shared task evaluating LLMs' ability to reason about Islamic inheritance. Based on MAWARITH (12,500 annotated Arabic cases), it requires full calculation: heir identification and share assignment. 16 teams tested prompting, RAG, and fine-tuning. Results show precise legal interpretation and structured numerical reasoning remain highly challenging.

Benchmarks Reasoning RAG

SIG

HYP

arXiv cs.AI·Jun 15

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

HarnessX is a foundry for composable and adaptive AI agent harnesses. It uses AEGIS, a trace-driven multi-agent evolution engine, to optimize prompts, tools, and control flow. Across 5 benchmarks (ALFWorld, GAIA, WebShop, tau³-Bench, SWE-bench), HarnessX achieves +14.5% average gain (up to +44%), without model scaling.

AI Agents Multi-agent Prompt engineering

SIG

HYP

arXiv cs.AI·Jun 15

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

GitOfThoughts stores LLM agent reasoning as a git repository: each thought is a commit, scores are notes, outcomes are tags. Empirical study across 5 memory substrates (none, markdown, vector, graph, git): memory improves accuracy only when retrieved cases are near-duplicates of current problems (similarity >0.8). Main lever remains test-time sampling.

AI Agents Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 15

Sorries Are Not the Hard Part: An Expert-Review Case Study of a Semi-Autonomous Formalization

A case study on semi-autonomous formalization of Grothendieck's vanishing theorem shows LLMs close proof gaps but produce non-reusable formalizations. After expert review, agents adapt well to local feedback but fail at designing sound definitions and APIs.

Reasoning Code generation Evals

SIG

HYP

arXiv cs.LG·Jun 15

Contract-Based Compositional Shielding for Safe Multi-Agent Reinforcement Learning

Decentralised shielding method for multi-agent reinforcement learning ensuring global safety without centralised runtime control. Agents share a global LTL_safe specification and select local obligations whose conjunction implies the global specification, via a non-stationary multi-armed bandit. Evaluation across 6 environments and 15 algorithmic variants.

Multi-agent Reinforcement learning AI safety

SIG

HYP

arXiv cs.AI·Jun 15

FactoryLLM: A Safe and Open-Source AI Playground for Evaluating LLMs in Smart Factories

FactoryLLM is an open-source playground to evaluate LLM-based RAG models for fault diagnostics in smart factories. It analyzes multi-machine documentation with dual evaluation (RAGAS + LLM-as-a-Judge). Case study: 3 LLMs tested on 30 maintenance queries achieving groundedness scores > 0.88.

RAG Evals Open source

SIG

HYP

arXiv cs.AI·Jun 15

When Sample Selection Bias Precipitates Model Collapse

Recursive training on synthetic data risks model collapse: data selection based on fragmented local references removes globally relevant tail modes. Authors theoretically prove siloed selection accelerates collapse and propose Wasserstein proxy references across silos without sharing raw data.

Papers Benchmarks AI safety

SIG

HYP

arXiv cs.LG·Jun 15

Neural Slack Variables for Shape Constraints

New method to enforce inequality constraints (monotonicity, convexity) in neural networks using neural slack variables. Couples primary network with jointly learned auxiliary network, converting constraint enforcement into regression problem. Achieves zero measured violations on monotonicity/convexity test cases, outperforming penalty and primal-dual baselines.

Papers Reasoning Fine-tuning

SIG

HYP

arXiv cs.LG·Jun 15

High-Frequency Pricing at Scale for E-Commerce

Zalando deploys high-frequency algorithmic pricing system for 5M+ articles during sales campaigns. Forecast-then-optimize architecture combining gradient boosting and multi-objective optimization. A/B tests across 12 markets (2023-2024): +6% profit, decision time reduced from hours to minutes.

Business Benchmarks Tools

SIG

HYP

arXiv cs.AI·Jun 15

SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing

SkillAudit is a framework for evolving LLM agent skills without ground-truth feedback. Via paired trajectory auditing, the system executes the same task with and without the candidate skill, then uses Process-Aligned Contrastive Evaluation to isolate behavioral changes. Across 89 tasks, SkillAudit achieves 73.9% average task reward vs 56.7% for static expert skill.

AI Agents Reasoning Evals

SIG

HYP

Reddit r/LocalLLaMA·Jun 14

EAGLE support merged into llama.cpp

EAGLE support has been merged into llama.cpp. EAGLE is an inference acceleration technique for language models that reduces latency by predicting multiple tokens in parallel.

Llama Code generation Infrastructure

SIG

HYP

OpenAI Blog·Jun 14

Introducing the OpenAI Partner Network

OpenAI launches Partner Network with $150M investment to accelerate enterprise AI adoption, deployment, and transformation among global partners.

OpenAI Business

SIG

HYP

The Decoder·Jun 14

KPMG fabricated AI case studies in a report designed to sell clients on AI adoption

KPMG published a report on AI in business containing fabricated case studies involving UBS, the NHS, and other organizations. Edward Tian (GPTZero) helped uncover the errors and warns of 'secondary hallucinations': false claims from trusted consulting firms spreading unchecked. KPMG has withdrawn the report.

AI safety Business Evals

SIG

HYP

Reddit r/MachineLearning·Jun 14

The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]

Paper presented at ACM CAIS 2026 on safety evaluation for tool-using LLM agents. Authors distinguish safe success, unsafe success, and failure, showing verification reduces unsafe success but also decreases task completion as horizon increases ("Verifier Tax"). Two-tier architecture: deterministic policy checks followed by LLM-based verifier.

AI Agents AI safety Evals

SIG

HYP

Simon Willison·Jun 13

Publishing WASM wheels to PyPI for use with Pyodide

Pyodide 314.0 enables publishing Python packages built for WASM directly to PyPI (via PEP 783). Previously, Pyodide maintainers manually managed 300+ packages. Package developers can now distribute WASM wheels like native Linux/macOS/Windows wheels.

Open source Infrastructure Tools

SIG

HYP

GitHub Trending·Jun 13

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> vercel /</span> ai

Vercel releases AI SDK, a free open-source TypeScript library for building AI-powered applications and agents. Tool from Next.js creators.

AI Agents Code generation Open source

SIG

HYP

GitHub Trending·Jun 13

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> NVIDIA /</span> physicsnemo

NVIDIA releases PhysicsNeMo, an open-source deep-learning framework for building, training, and fine-tuning models using state-of-the-art Physics-ML methods.

Open source Infrastructure Fine-tuning

SIG

HYP

The Decoder·Jun 13

US government forces Anthropic to disable Claude Fable 5 and Mythos 5 for all customers worldwide

US government ordered Anthropic to disable Claude Fable 5 and Mythos 5 globally, citing jailbreak risks. Anthropic complies but disputes: vulnerabilities are minor and exist in GPT-5.5. Company warns precedent could halt all frontier deployments.

Claude Anthropic AI safety

SIG

HYP

ActuIA·Jun 13

Anthropic contraint de suspendre Fable 5 et Mythos 5 après une directive du gouvernement américain

On June 12, 2026, a U.S. government directive forces Anthropic to suspend Fable 5 and Mythos 5 for all users, citing jailbreak risk. Anthropic complies but contests, arguing a potential narrow workaround does not justify recalling a widely deployed model.

Anthropic AI safety Regulation

SIG

HYP

Simon Willison·Jun 13

Statement on the US government directive to suspend access to Fable 5 and Mythos 5

US government ordered Anthropic to suspend access to Fable 5 and Mythos 5 for all users citing national security concerns. The directive cites risk of a 'jailbreak' method bypassing model safeguards. Anthropic disputes this: identified vulnerabilities are minor and available on other public models including OpenAI's GPT-5.5.

Anthropic Regulation AI safety

SIG

HYP

Vercel AI Blog·Jun 13

Workflow SDK now runs natively in Nitro v3

Vercel Workflow SDK now integrates natively with Nitro v3 in beta. Steps run in the same runtime as the app with direct access to server APIs. Web UI for monitoring available at /_workflow. Optimized bundling with tree-shaking reduces bundle size.

Tools Infrastructure Code generation

SIG

HYP

ActuIA·Jun 12

JPMorgan et Goldman Sachs entrent dans une levée IA pre-revenue à 41 Md$

Prometheus, physical AI startup co-founded by Jeff Bezos and Vik Bajaj in late 2025, raises $12B Series B at $41B valuation. JPMorgan and Goldman Sachs participate in the round.

Robotics Funding

SIG

HYP

The Decoder·Jun 12

Anthropic's Claude Fable 5 costs twice as much for 5.7 percent more performance

Claude Fable 5 scores 64.9 points on the Artificial Analysis Intelligence Index and sets records on 5 of 10 benchmarks. Performance gain over Opus 4.8 is only 5.7% while token costs double. Safety filters with fallback routing further increase expenses.

Claude Benchmarks AI safety

SIG

HYP

Hugging Face Blog·Jun 12

olmo-eval: An evaluation workbench for the model development loop

Hugging Face releases olmo-eval, an evaluation workbench for the model development loop. The tool automates performance testing and enables rapid iteration during language model training and fine-tuning.

Tools Evals Open source

SIG

HYP

ActuIA·Jun 12

Aidés par GPT-5, puis livrés à eux-mêmes : un essai randomisé mesure le coût d'apprentissage de l'assistance IA

A randomized controlled trial (arXiv, April) measures the impact of learning with GPT-5 on skill retention after assistant removal. Results quantify the cognitive cost of AI dependency.

GPT Evals Reinforcement learning

SIG

HYP

ActuIA·Jun 12

S-1 confidentiels : OpenAI emboîte le pas à Anthropic, et la SEC obtiendra ce que les valorisations privées cachaient

OpenAI filed a confidential S-1 with the SEC on June 9, eight days after Anthropic. Both AI companies are preparing IPOs, disclosing financial data previously hidden in private valuations.

OpenAI Anthropic Business

SIG

HYP

Reddit r/LocalLLaMA·Jun 12

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS)

Open Dungeon is a local roleplay game using Gemma 4 QAT (12B) via Ollama for narration and FLUX for image generation. Runs on 7.7GB RAM with full 256K context, no APIs or cloud. Features Do/Say/Story modes, line editing, model selection. MIT licensed, source available.

Gemini Open source Image generation

SIG

HYP

The Decoder·Jun 12

OpenAI buys Ona to push Codex toward long-running, autonomous coding tasks

OpenAI acquires Ona (formerly Gitpod), a German startup founded in 2020 specializing in AI agents and secure cloud development environments. The acquisition aims to enhance Codex's capabilities for long-running autonomous coding tasks.

OpenAI Code generation AI Agents

SIG

HYP

Reddit r/LocalLLaMA·Jun 12

Huawei Released openPangu 2.0 (Will open source on June 30)

Huawei launches openPangu 2.0 at HDC 2026 (June 12). Two versions: Pro (505B params, 18B activated) and Flash (92B params, 6B activated). 512K context, 28:1 sparsity. Optimized for Ascend: 2x throughput, reduced latency. Open-source from June 30 (architecture, weights, inference and training code).

Open source Benchmarks Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·Jun 12

EAGLE3 has landed in llama.cpp

EAGLE3 merged into llama.cpp after 6 months of development. The helper model receives guidance from the main model, unlike MTP where it operates independently.

Llama Open source

SIG

HYP

arXiv cs.CL·Jun 12

SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants

SkillChain automates skill evolution for multimodal e-commerce AI assistants. The system manages three stages: Skill creation from task specs, routing optimization, and iterative refinement via dual-path LLM-Judge evaluation. Deployed at production scale, it improves structural compliance and content quality, confirmed by A/B testing on user engagement.

AI Agents Multi-agent Vision

SIG

HYP

arXiv cs.CL·Jun 12

SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings

SafeLLM compares line-number extraction to free-form rewriting for RAG systems in safety-critical settings (SOPs, HR policies, medical guidelines). Line-based extraction outperforms direct copying and safety-focused strategies, achieving 95% term recall on NHS and NICE documents with better source text fidelity.

RAG AI safety Evals

SIG

HYP

arXiv cs.AI·Jun 12

APCyc: Property-Informed Design of Cyclic Peptides via Automated Cyclization

APCyc is a target-aware de novo cyclic peptide generation framework that explicitly models cyclization and jointly optimizes multiple physicochemical properties. The model uses an expanded residue vocabulary and Bayesian posterior guidance to generate cyclization-aware peptides adapted to specific therapeutic targets.

Code generation Reinforcement learning Papers

SIG

HYP

arXiv cs.CL·Jun 12

A Context-Aware Dataset for Stance Detection in Bioethical Controversies on Reddit

BioStance: dataset of 39,600 annotated Post-Comment pairs from Reddit for stance detection in bioethical debates. Covers 6 controversial targets (value conflicts, individual liberty vs collective responsibility, technological uncertainty). Triple-annotated, Krippendorff's α = 0.82.

Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 12

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

Shopping Reasoning Bench: expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10,863 importance-weighted binary rubrics for evaluating conversational shopping assistants. Evaluation of 9 models (GPT, Claude, Gemini): pass rates 57–77%, performance degrades 4–18 points across conversation turns, 13–29 point gap between required and optional criteria.

Benchmarks GPT Claude

SIG

HYP