Page 51 of 144

AllHigh signalRecent

5744 articles

Vector Linking via Cross-Model Local Isometric Consistency

Method to establish correspondences between embedding vectors from different black-box encoders. Exploits local geometric consistency of independently trained contrastive encoders: short-range distances preserved up to scale factor. Uses iterative reference-based geometric embedding hashing with paired anchors to recover vector links. Code released.

Embeddings Vector search Benchmarks

SIG

HYP

arXiv cs.AI·4d ago

HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster

Novel transformer-based architecture for autonomous resource management in heterogeneous satellite clusters (optical and SAR). Uses model-free reinforcement learning for real-time decision-making in Earth Observation missions. Demonstrates significant performance improvements and transferability across varying cluster sizes.

Multi-agent Reinforcement learning Reasoning

SIG

HYP

arXiv cs.AI·4d ago

A Persona-Based Evaluation Framework for Pluralistic Alignment in Generative AI

Persona-based evaluation framework for pluralistic alignment in generative AI. Replaces monolithic benchmarks with structured manifold of synthetic cognitive profiles representing diverse human perspectives. Reveals systematic degradation of persona coherence under sequential inference, suggesting need for dynamic regulatory mechanisms.

Alignment Evals Benchmarks

SIG

HYP

arXiv cs.AI·4d ago

COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents

COMPASS is a safety alignment framework for multi-step LLM search agents. It combines Cognitive Tree Exploration (CTE) to synthesize stealthy attack trajectories and Introspective Step-wise Alignment (ISA) to supervise risky intermediate actions. Results: favorable safety-utility trade-off requiring substantially less training data.

AI Agents AI safety Alignment

SIG

HYP

arXiv cs.CL·4d ago

Counterfactual Graph for Multi-Agent LLM Calibration

Multi-agent LLM systems assume agreement between agents indicates reliability. Authors show communication induces correlated failures and false consensus. They propose CAGE-CAL, a counterfactual agent-graph calibration framework comparing post-communication dependencies with no-communication scenarios to adjust confidence accordingly.

Multi-agent AI Agents Reasoning

SIG

HYP

arXiv cs.CL·4d ago

AI for Monitoring and Classifying Data Used in Research Literature

Method to detect and classify dataset usage in research literature using a multitask GLiNER framework. Combines dataset mention extraction, relation identification, and usage-context classification. Leverages synthetic data generation and LLM-based revalidation to address label scarcity.

Papers Benchmarks RAG

SIG

HYP

arXiv cs.CL·4d ago

Knowledge Graph-Enhanced Zero-Shot Topic Classification: A Multi-Strategy Comparative Study

Comparative study of zero-shot multi-label topic classification using knowledge graphs extracted from documents. Framework tested on 15 LLMs and 8 datasets: keyword-enhanced variant outperforms baseline, graph augmentation helps small models but hurts large ones, and self-consistency decoding increases costs fivefold without performance gains.

RAG Benchmarks Papers

SIG

HYP

arXiv cs.AI·4d ago

Procedural Generation of First Person Shooter Maps using Map-Elites

Study applying MAP-Elites (quality diversity algorithm) to procedural generation of FPS levels. Two novel representations (Point-Line, Spatial-Layout) improve map characterization. Topological and emergent metrics defined. MESB generates map populations with higher diversity and quality than previous approaches.

Benchmarks Papers

SIG

HYP

Reddit r/LocalLLaMA·4d ago

Built a fun weekend project: An MCP server for generating Mandelbrot visualizations

Developer built an MCP server enabling LLMs to explore the Mandelbrot set with rendering tools, presets for interesting regions (Seahorse Valley, Elephant Valley), inspection tools for iteration counts and viewport settings, color palette selection, and HTML gallery generation. Tested with Qwen 3.6-35B. GitHub: openmandel.

MCP Qwen Tools

SIG

HYP

Simon Willison·4d ago

datasette 1.0a32

Datasette 1.0a32 fixes a bug with INSERT ... RETURNING queries via the new /db/-/execute-write endpoint and multiple base_url issues found during Service Worker experiments.

Tools Open source

SIG

HYP

Reddit r/LocalLLaMA·4d ago

Qwen3.6-35B vs Gemma4-26B on 7900 XTX

Benchmark on Radeon 7900 XTX: Qwen3.6-35B vs Gemma4-26B with reasoning enabled. Qwen generates 2x more tokens (14,811 vs 7,386) but Gemma is ~20% faster end-to-end (95.6s vs 118.8s). Qwen's MTP reaches 130 tok/s vs 78 tok/s, but token count becomes the bottleneck. Quality close, interesting per-task splits.

Qwen Gemini Reasoning

SIG

HYP

Reddit r/LocalLLaMA·5d ago

I built mlx-Chronos — a community benchmark leaderboard for local LLM engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama)

mlx-Chronos is an open-source CLI tool and community leaderboard to compare MLX inference engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama). Measures TTFT, throughput, RAM, and thermal state with standardized methodology. Leaderboard currently populated by M2 8GB, seeking M3/M4 results.

Open source Benchmarks Infrastructure

SIG

HYP

Reddit r/MachineLearning·5d ago

Built an AI Accelerator and opensourced it. [P]

Developer open-sources AI accelerator on FPGA (AWS F2) based on RocketChip/RISC-V with attention mechanism built into silicon. Benchmarks: 225× speedup vanilla attention, 96× TinyBERT, 50× ViT, 30× GPT-2 prefill. Native BF16 support.

Infrastructure Open source Benchmarks

SIG

HYP

The Decoder·5d ago

Ask AI what goes with chicken and the answer depends on whether it learned from recipes or molecules

Kaikaku.AI releases Epicure, three AI models separating ingredients by recipe compatibility or chemical similarity. Trained on 4.14 million multilingual recipes and FlavorDB, they generate different recommendations per source. The chemistry-only model outperforms recipe-based variants on taste and nutrition classification without direct data.

Fine-tuning Benchmarks Tools

SIG

HYP

Reddit r/LocalLLaMA·5d ago

Speed difference between Windows 11 and Linux with llama.cpp: a myth when using medium and large MoE models

llama.cpp benchmark comparing Windows 11 and Linux (Ubuntu 26.04) on Nvidia GPU (RTX 5080 + 2× RTX 5060 Ti). No significant performance difference: Qwen 3.5 122B achieves PP 300/TG 28 (Windows) vs PP 290/TG 28.5 (Linux); Qwen 3.5 397B: PP 140/TG 16 vs PP 150/TG 15.2. Tests repeated 4 times with recent llama.cpp including VRAM optimization.

Llama Qwen Benchmarks

SIG

HYP

The Decoder·5d ago

Anthropic study finds men use AI coding agents more than twice as often as women in social science research

An Anthropic study finds researchers with typically male names use AI coding agents more than twice as often as those with typically female names, controlling for discipline and career level. Economists lead at 39%, education researchers at 4%. The gender gap for coding agents far exceeds that for general AI use.

Anthropic AI Agents Code generation

SIG

HYP

Reddit r/MachineLearning·5d ago

I built mlx-Chronos — a community benchmark leaderboard for local LLM engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama) [P]

mlx-Chronos is an open-source CLI tool and community leaderboard to benchmark local LLM inference engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama). Measures TTFT, throughput, RAM, and thermal state with standardized methodology. Currently populated only with M2 8GB results.

Open source Benchmarks Infrastructure

SIG

HYP

The Decoder·5d ago

AI search agents often confirm what they already know instead of actually researching the web

AI search agents like GPT-5.4 and Kimi K2.6 mostly confirm their training knowledge rather than genuinely researching the web. Researchers at Harbin Institute of Technology demonstrated this using LiveBrowseComp, a benchmark based on events from the last 90 days. Without relying on training memory, performance collapses.

Benchmarks AI Agents GPT

SIG

HYP

Reddit r/LocalLLaMA·5d ago

Cost Analysis of my $6.4k Local LLM Server

TCO analysis of a $6.4k local LLM server with 4x MI100 32GB GPUs and EPYC 48-core CPU. Runs 4 llama.cpp instances with Qwen 3.6 27B on ROCm. Processes 20.4M input tokens and 1.32M output tokens daily. Equivalent API cost: $3,701/year ($308/month). Author emphasizes proper hardware depreciation accounting for realistic TCO.

Open source Infrastructure Llama

SIG

HYP

Simon Willison·5d ago

Running Python ASGI apps in the browser via Pyodide + a service worker

Simon Willison used Claude Opus 4.8 via Claude Code to implement running Python ASGI apps in the browser via Pyodide and Service Workers. This approach replaces the previous Web Workers implementation, enabling JavaScript execution and fixing Datasette Lite limitations. Working demos are available.

Claude Code Code generation Tools

SIG

HYP

The Decoder·6d ago

Attackers abuse shared ChatGPT and Claude chats to spread malware

Attackers exploit ChatGPT and Claude's chat-sharing features to distribute malware. Fake chats mimic error messages or installation guides and bypass security tools by being hosted on trusted domains.

AI safety

SIG

HYP

Reddit r/LocalLLaMA·6d ago

Vidai Community is now available: one Rust binary for cost attribution, guardrails and multi-provider routing on every LLM call

Vidai Community, open-source Rust binary, unifies cost attribution, guardrails and multi-provider routing for LLM calls. One-line integration by changing base_url (OpenAI/Anthropic/Google). Per-user/team/model cost tracking, hard budgets, 1.95ms median overhead, 21,803 RPS on single node.

Tools Infrastructure Open source

SIG

HYP

Reddit r/MachineLearning·6d ago

What I learned building a debugger for PyTorch training loops and how it changed how I think about failure diagnosis [D]

Developer built NeuralDBG, a PyTorch debugger that automatically detects training failures (vanishing/exploding gradients, data anomalies). Key insight: failures are layer-localized, not global. Effective monitoring: gradient norm transitions per layer rather than raw histograms. Open-source tool available on PyPI.

Tools Code generation Open source

SIG

HYP

Reddit r/LocalLLaMA·6d ago

made a local voice AI for windows you can talk to in any language. open source, bring your own key

Shadow AI is an open-source (AGPL-3.0) local voice assistant for Windows. Natural multilingual conversations, local web search via SearXNG, persistent memory, optional Google integrations (Gmail, Calendar, Drive). Uses user's free Gemini API key, zero remote servers.

Voice Gemini Open source

SIG

HYP

Reddit r/LocalLLaMA·6d ago

Anyone using Flash Attention 2 (ai-bond) on their V100's? How is the performance?

User benchmarks Flash Attention 2 (ai-bond) on V100. Results show 7-24x speedup in backward pass, memory reduction up to 91.9% (323.4 MB saved). Thinking time before answering minimized. Numerical validation passes on causal and non-causal configurations.

Infrastructure Benchmarks Open source

SIG

HYP

Reddit r/LocalLLaMA·6d ago

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

MTP (Multi-Token Prediction) benchmark on Gemma 4 31B and Qwen 3.6 27B using vLLM and llama.cpp. Result: 3.34x speedup (132.52 vs 39.69 tok/s). vLLM outperforms llama.cpp on Gemma 4; llama.cpp solid on Qwen. No confirmed quality degradation, VRAM overhead negligible.

Gemini Qwen Code generation

SIG

HYP

The Decoder·6d ago

OpenAI is giving away its life sciences AI model to help governments prepare for the next pandemic

OpenAI is offering its life sciences AI model GPT-Rosalind for free through the Rosalind Biodefense program to help governments prepare for future pandemics. Early partners include Lawrence Livermore National Laboratory, Johns Hopkins, and CEPI.

OpenAI GPT AI safety

SIG

HYP

Le Big Data·May 29

Airbus s’allie à Mistral AI pour développer une IA souveraine dans l’aéronautique

Airbus partners with Mistral AI to develop sovereign artificial intelligence in the aerospace sector. The partnership aims to integrate secure AI models into the group's operations and processes.

Mistral Business AI safety

SIG

HYP

ActuIA·May 29

Pourquoi Nvidia mise sur Decart, une start-up IA capable d’optimiser aussi les puces concurrentes

Nvidia invests $300M in Decart, a startup focused on world models and software optimization. Nvidia's participation aims to control an optimization layer capable of running on its chips and those of competitors.

Infrastructure Business Funding

SIG

HYP

ActuIA·May 29

Outils RH et intelligence artificielle : l’Europe repousse les obligations haut risque à décembre 2027

The EU postpones to December 2027 the enforcement of obligations for high-risk AI systems in HR tools. A provisional political agreement on May 7, 2026 regarding the Digital Omnibus AI amends the timeline of regulation 2024/1689.

Regulation AI safety

SIG

HYP

arXiv cs.CL·May 29

Lightweight Multimodal LLM-Enabled Cost-Effective Defect Grading of Power Transmission Equipment

Defect grading framework for power transmission equipment using MLLM. In-context learning on commercial models, chain-of-thought Q&A generation to reduce manual annotation, then fine-tuning Qwen3-VL-8B via LoRA. SOTA on three grading tasks.

Qwen Vision Fine-tuning

SIG

HYP

arXiv cs.AI·May 29

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

Study on long chain-of-thought traces used for LLM supervised fine-tuning. Researchers identify "harmful continuation": when reasoning continues after the answer is sufficiently supported. Removing these continuations improves fine-tuning outcomes. They propose HCC (Harmful Continuation Cut), a lightweight proxy to detect these boundaries.

Reasoning Fine-tuning Papers

SIG

HYP

arXiv cs.AI·May 29

BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation

BEAMS establishes benchmarks to evaluate AI tools for modeling and simulation. The open-source sd ai project tests multiple LLMs on tasks including causal translation, model iteration, and causal reasoning. Results show AI tools perform better at qualitative discussion than causal reasoning and quantitative error fixing.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.CL·May 29

Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models

M2R (Micro-Macro Retrieval) is a retrieve-while-generate framework reducing hallucinations in long-form LLM generation. It combines macro retrieval (external evidence) and micro retrieval (key information from reasoning) to maintain proximity between factual data and outputs. Trained via reinforcement learning with rule-based rewards.

RAG Reinforcement learning

SIG

HYP

arXiv cs.CL·May 29

Bosses, Kings, and the Commons: Cooperation Under Power Asymmetry in LLM Societies

SovSim, a multi-agent simulation framework, evaluates how 11 state-of-the-art LLMs manage shared resources under asymmetric power structures. Finding: introducing an agent with disproportionate power (boss/king) causes 87.3% degradation in survival rate and cooperation breakdowns compared to symmetric settings.

Multi-agent AI Agents Benchmarks

SIG

HYP

arXiv cs.CL·May 29

LLMBridge: An LLM Pipeline for End-to-end Referential Bridging Resolution in English

LLMBridge is an LLM-based system for end-to-end referential bridging resolution in English. The pipeline combines heuristic pre/post-processing with LLM natural language inference capabilities. Evaluated on ISNotes, BASHI, and GUMBridge, it outperforms previous state-of-the-art systems on all three datasets in both end-to-end and gold anaphor settings.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 29

Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction

Theoretical paper on stabilizing off-policy temporal-difference learning with function approximation. Proposes BA-TDC and BA-TDRC, replacing TDC's auxiliary matrix with behavior Bellman matrix. Linear analysis with convergence proof under Hurwitz stability condition; experiments on Markov chains and classical counterexamples.

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 29

Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

STHTD-MP, a new off-policy temporal-difference method, replaces the covariance metric with the behavior-policy Bellman matrix in the primal-dual saddle-point formulation. Formal convergence analysis and spectral comparison with GTD2-MP show potential gains on benchmarks (Random Walk, Boyan Chain).

Reinforcement learning Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 29

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

Redpanda introduces an Agentic Data Plane architecture using out-of-band metadata channels to enforce security policies, data classifications, and behavioral constraints outside the agent's read/write path. These channels prevent hallucinations and adversarial manipulation while maintaining tamper-proof audit trails. Demonstrated with a multi-agent portfolio rebalancing system.

AI Agents Multi-agent AI safety

SIG

HYP

arXiv cs.AI·May 29

The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling

The Cognitive Categorical Transformer (CCT), a 306M-parameter model augmenting GPT-2 Small, incorporates category-theoretic and cognitive-science-inspired components. On WikiText-103, CCT achieves 21.27 validation perplexity versus 24.19 for GPT-2 Small baseline, a 12% relative reduction (2.92 PPL). Ablations show simplicial message passing accounts for 84% of the improvement.

GPT Papers Benchmarks

SIG

HYP