Page 61 of 147

AllHigh signalRecent

5859 articles

The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems

Theoretical paper proving architecture alone sets an accuracy ceiling (Deterministic Horizon) between 19–31 layers across 12 transformers. Beyond it, no training improves it. Converts 16 impossibility results (Turing, Arrow, No Free Lunch) into design rules for trustworthy AI systems, with computable bounds and quantified violation costs.

Reasoning Evals AI safety

SIG

HYP

arXiv cs.AI·May 25

Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning

VDSS, a multi-agent system for ventilator decision support, coordinates modular components through structured interfaces and contextual bandit preference learning from clinician feedback. Structured rejection triggers targeted replanning. Retrospective ICU validation shows higher recommendation acceptability and fewer interaction cycles.

Multi-agent Reinforcement learning AI Agents

SIG

HYP

arXiv cs.CL·May 25

Cultural Adaptation in Large Language Models for Political Discourse

Paper formalizing cultural adaptation for LLMs in political discourse analysis. Identifies English dominance bias and systematic failures across linguistic/institutional contexts. Proposes evaluation matrix (cultural fidelity, calibration, democratic safety) and methodologies: participatory datasets, culturally-aware transfer learning, culturally-measurable benchmarks.

Benchmarks Evals AI safety

SIG

HYP

arXiv cs.CL·May 25

ClimateChat-300K: A Multi-Modal Facebook Dataset for Understanding Diverse Perspectives in Climate Communication

ClimateChat-300K: dataset of 299,329 public Facebook posts on climate change (May 2020–May 2024), collected via CrowdTangle. 41 metadata features, 26,000+ global pages. Topic modeling and sentiment analysis identify 10 themes across 5 domains; emotionally charged and visually rich content drives highest engagement. Open resource for studying polarization and misinformation.

Benchmarks Papers Open source

SIG

HYP

arXiv cs.CL·May 25

Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement

Researchers reveal that LLM-generated texts contain hidden human-like spans that complicate detection. They propose a stacked model-agnostic framework using a hard-EM procedure to iteratively filter human-like subsequences and enhance existing detectors, working also in training-free mode.

Evals AI safety Papers

SIG

HYP

arXiv cs.CL·May 25

When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

SCID-anchored benchmark of 555 semi-structured interviews evaluates 5 LLMs (GPT-4.1 Mini, GPT-5 Mini) on psychiatric screening (anxiety, depression, PTSD). Accuracy 0.49–0.86, MCC 0.16–0.38. False negatives reveal models downweight symptoms when functioning is preserved or social support present, requiring clinical validation before deployment.

Benchmarks GPT AI safety

SIG

HYP

arXiv cs.CL·May 25

A Comparative Evaluation of Structural Topic Models and BERTopic for Short, Open-Ended Survey Responses

Comparison of Structural Topic Models (STM) and BERTopic for analyzing short open-ended survey responses. BERTopic produces higher topic coherence, strengthened by contextual augmentation (strategy introduced to enrich very short responses). STM offers better support for inferential covariate analysis, BERTopic for interpretability.

Embeddings Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 25

The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management

Unified framework for cost-performance optimization in LLM context management. Jointly evaluates task performance, token cost, and preprocessing reuse on 5,000 HotpotQA instances. Reduces effective token usage by 25% at comparable performance (F1≈0.78) and achieves 50% lower token cost with memory compression versus full-context prompting.

RAG Benchmarks Infrastructure

SIG

HYP

arXiv cs.CL·May 25

What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

Empirical study on curriculum effects for RL memory agents in multi-session dialogue with external memory banks. Three training conditions tested (LoCoMo only, LoCoMo + LongMemEval, LongMemEval only) show curriculum composition shapes specialized skills rather than uniform performance scaling. Mixed curriculum achieves strongest overall F1.

Reinforcement learning AI Agents Reasoning

SIG

HYP

arXiv cs.CL·May 25

Can AI Guess What You Know? Performance Comparison of Large Language Models for Human Domain Knowledge Estimation From Communication Logs

Comparative study of 7 LLMs (Gemini, Claude, GPT) to estimate professional expertise from Slack logs. On 27,188 messages from 43 users, Gemini 2.5 Flash achieves lowest error (MAE 21.13%). Accuracy depends only weakly on message volume.

Benchmarks Gemini Claude

SIG

HYP

arXiv cs.CL·May 25

Knowledge Distillation for Low-Resource Open-source Text-to-SQL Model

Knowledge distillation framework for Text-to-SQL in low-resource settings. Constructs knowledge base (schema semantics, abbreviations, business logic) injected during training and inference. Generates contextually grounded synthetic data. Evaluated on 7 benchmarks: improves open-source and closed-source LLMs, especially on domain-specific datasets.

Code generation Fine-tuning RAG

SIG

HYP

arXiv cs.LG·May 25

Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics

GPLD adds gradient-penalized latent dynamics regularization to DreamerV3 to encourage smooth transition learning in latent space. Tested on DeepMind Control, GPLD improves sample efficiency, with strong gains on complex locomotion and quadruped tasks.

Reinforcement learning Papers Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·May 25

I shipped a windows desktop app for running local LLMs with a button that turns your "no thats wrong" into actual LoRA training data

SEELS, a Windows desktop app for local LLMs, lets users correct model replies via a « Teach » button that accumulates corrections into a JSONL corpus, then triggers PEFT LoRA fine-tuning without terminal access. Includes local STT/TTS (Whisper/Piper), hardware dashboard, 0.6B model pre-trained on 110 examples. Free stable version; pro tier (image/video gen, MCP) and max tier (workflows, multi-GPU) in roadmap.

Fine-tuning Open source Tools

SIG

HYP

Simon Willison·May 24

datasette-fixtures 0.1a0

Release of datasette-fixtures 0.1a0, a plugin leveraging the new datasette.fixtures.populate_fixture_database() API introduced in Datasette 1.0a30. Enables creation of fixture databases for plugin test suites.

Tools Open source

SIG

HYP

The Decoder·May 24

ByteDance study finds that asking LMMs questions beats making it transcribe text for long document training

ByteDance Seed demonstrates that a 7B model answers questions on long, image-heavy documents more reliably than much larger models, even on documents 4× longer than training data. Key finding: learning via question-answering outperforms text transcription approaches.

Vision Benchmarks Fine-tuning

SIG

HYP

Reddit r/MachineLearning·May 24

PapersWithCode new features - week 1 [P]

Niels (Hugging Face) announces week 1 of paperswithcode.co, a revival tracking SOTA across AI domains. New features: multiple metrics per benchmark (WER/RTFx for ASR, mAP/FPS for detection), external paper support (GitHub, blogs, BioRxiv), paper lineage (predecessors/successors), new methods (Gated DeltaNet, Kimi Delta Attention, Mamba-2).

Benchmarks Papers Open source

SIG

HYP

Reddit r/MachineLearning·May 24

Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA [D]

Benchmark on 30 long PDFs (171 questions) comparing native vision-LLMs vs OCR pipelines for document QA. Claude Sonnet 4.5 used. LlamaCloud premium achieves 59.6% accuracy ($0.1885/query), native vision 52% ($0.2552/query, most expensive). Vision underperforms on charts/tables; premium OCR more robust. Vision-LLM has 7% intrinsic failure rate vs 0% for OCR after retries.

Vision Benchmarks RAG

SIG

HYP

Reddit r/LocalLLaMA·May 24

Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA

Benchmark on 30 long PDFs (171 questions) comparing vision LLMs vs OCR for document QA. Claude Sonnet 4.5 native PDF: 52% accuracy, $0.2552/query (5th/6). LlamaCloud premium + OCR: 59.6%, $0.1885/query. Vision underperforms on charts/tables; premium OCR more robust. Vision LLM has 7% intrinsic failure rate vs 0% for OCR after retry.

Claude Vision RAG

SIG

HYP

Reddit r/LocalLLaMA·May 24

llampart 1.0.0 - I released a standalone local web UI for llama-server with translations, extended settings and a polished conversation sidebar

llampart 1.0.0, standalone local web UI for llama-server, released as MIT open-source. Features extended settings, 6-language localization, two-column conversation sidebar, MCP integration, interface modes (dark/light/Frosted Glass), local import/export, and Caddy deployment guide.

Llama Open source Tools

SIG

HYP

Reddit r/LocalLLaMA·May 23

llama.cpp server have built-in native tools (exec_shell, edit_file, etc.)

llama.cpp server natively supports agent tools (exec_shell, edit_file, read_file, grep_search, write_file, apply_diff, get_datetime) via --tools flag. Experimental feature turns the server into a mini-agent harness without external dependencies, but lacks security sandboxing currently.

Llama AI Agents Tools

SIG

HYP

Reddit r/MachineLearning·May 23

Per-pixel bounding-box regression + DBSCAN for handwritten word detection - visual walkthrough of WordDetectorNet [P]

WordDetectorNet uses per-pixel bounding-box distance regression + DBSCAN for handwritten word detection. Each pixel classified as a word pixel regresses 4 scalar distances, generating thousands of candidates merged via DBSCAN with distance = 1 − IoU. Architecture: ResNet18 → FPN-style decoder → 6 output channels per pixel (2 segmentation logits + 4 distances). Trained on IAM, 448×448 → 224×224.

Vision Code generation Open source

SIG

HYP

Reddit r/MachineLearning·May 23

I fine-tuned an LLM to be C-3PO to test which training data format works best for persona injection [P]

LoRA fine-tuning experiment comparing three data formats for C-3PO persona injection: chat demos, first-person statements, and synthetic Wikipedia docs. First-person statements win on generalization. Synthetic docs produce paradoxical behavior: model knows C-3PO is anxious but expresses it only 37% of the time.

Fine-tuning Prompt engineering Papers

SIG

HYP

Reddit r/MachineLearning·May 23

AgentLantern: exposing the hidden graph of AI agent projects [P]

AgentLantern is an open-source devtool making AI agent projects inspectable before and during runtime. Three components: Lantern Docs (auto-generated documentation), Lantern Lint (static checking), and Lantern Play (runtime viewer). Initial support for CrewAI.

AI Agents Multi-agent Tools

SIG

HYP

Reddit r/LocalLLaMA·May 23

Did a 30 runs of llama-bench to find optimal settings for my use case (Frigate and HomeAssistant) on my MI60 32gb VRAM GPU - two models tested Gemma4 and Qwen3.6 - Figured I'd share in case it helps anyone else

User ran 30 llama.cpp benchmarks on MI60 32GB GPU to optimize Gemma 4 26B Q4_1 and Qwen3 35B Q4_0 for Frigate and HomeAssistant. Results: voice commands <1.2s, video summaries <18s. Systematic testing across KV cache depths (0, 1000, 6000 tokens) with 512-token prompt and 128-token generation.

Llama Benchmarks Code generation

SIG

HYP

Reddit r/MachineLearning·May 23

Interesting tension this week, the same companies racing to go public are also the ones making safety promises [N]

OpenAI and Anthropic accelerate IPO timelines while research exposes technical gaps: frontier models degrade performance on extended task chains, agents with tool access underperform in several cases. Tension emerges between safety commitments and public market pressure for growth.

AI Agents AI safety Business

SIG

HYP

Reddit r/LocalLLaMA·May 23

$16 refactor, 400 steps, 95% routed to open MoE

Developer cuts Claude Opus costs from $160 to $16 by routing 95% of steps to Hunyuan Hy3 (21B MoE) via vLLM routing layer. On 400-step Python refactoring, Hy3 handles 380 steps at $0.02 each ($7.60), Opus handles remaining 20 ($8). 93.4% success rate, but fails on complex dependency graphs.

AI Agents MCP Code generation

SIG

HYP

GitHub Trending·May 23

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> mukul975 /</span> Anthropic-Cybersecurity-Skills

Repository of 754 structured cybersecurity skills for AI agents, mapped to 5 frameworks (MITRE ATT&CK, NIST CSF 2.0, MITRE ATLAS, D3FEND, NIST AI RMF). Compatible with Claude Code, GitHub Copilot, Cursor, Gemini CLI and 20+ platforms. 26 security domains. Apache 2.0 license.

AI Agents Claude Code AI safety

SIG

HYP

GitHub Trending·May 23

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> OpenPipe /</span> ART

OpenPipe/ART: reinforcement learning framework for multi-step agents using GRPO. Enables on-the-job training across Qwen, GPT-OSS, Llama and other models.

AI Agents Reinforcement learning Open source

SIG

HYP

GitHub Trending·May 23

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> mukul975 /</span> Anthropic-Cybersecurity-Skills

AI Agents Claude Code AI safety

SIG

HYP

Reddit r/LocalLLaMA·May 23

First AI to Beat Every Human in a Programming Competition - Agentic GRPO Explained

Agentic GRPO, an RL algorithm adapted for multi-stage agentic workflows, enables AI agents to beat humans in programming competitions. Key innovation: immediate rewards at each step (hypothesis, code, tests, debug) with retroactive correction once final outcome is known, instead of waiting for complete workflow completion.

AI Agents Reinforcement learning Code generation

SIG

HYP

The Decoder·May 23

Alibaba's latest AI model ran autonomously for 35 hours to optimize code for its own custom chip

Alibaba releases Qwen3.7-Max, a proprietary model designed for long-running autonomous agent tasks. It matches Claude Opus 4.6 on benchmarks and outperforms DeepSeek V4 Pro and Kimi K2.6. The model ran autonomously for 35 hours to optimize code for Alibaba's custom chip.

Qwen AI Agents Benchmarks

SIG

HYP

The Decoder·May 23

Anthropic warns Claude Mythos Preview finds bugs faster than developers can patch them

Anthropic's Claude Mythos Preview discovered over 10,000 critical vulnerabilities in system-critical software through Project Glasswing with 50 partners. Bugs accumulate faster than developers can patch them. Anthropic warns no company has built sufficient safeguards against misuse.

Claude AI safety AI Agents

SIG

HYP

Reddit r/MachineLearning·May 23

I built a Mamba1 variant I call SM1 with d_state=1 that runs on Blackwell in pure PyTorch [P]

Mamba1 variant called SM1 with d_state=1 using two native PyTorch ops to replace selective scan. Exact closed-form solution, not an approximation. Reduces scan memory 16x versus Mamba1 (d_state=16). Inference state 14 KB for 130M model, O(1) per token. Training on 163K MIDI files (2.5B tokens).

Open source Code generation Reasoning

SIG

HYP

Reddit r/MachineLearning·May 23

Tested chunking + embeddings data from 3 production websites. [P]

Empirical RAG study on 3 production websites (Intercom, HubSpot, KPMG) with tiered chunking and embeddings. Results: 31% HIGH/MEDIUM chunks for Intercom, 32% HubSpot, 8% KPMG. Tier weighting (HIGH ×1.20) reranks top-k. Proposed metric: 'yield score' predicts corpus quality before generation.

RAG Embeddings Evals

SIG

HYP

Reddit r/LocalLLaMA·May 23

club-rdna16: practical 16GB AMD/Radeon local LLM testing repo

GitHub repo for testing local LLMs on 16GB AMD GPUs (RX 6900 XT, RX 7800 XT, etc.). Practical benchmarks with llama.cpp/ROCm: Qwen 27B and 35B-A3B, context up to 131k tokens, q8 KV cache profiles, throughput and retrieval measurements. Reproducible configurations and call for community contributions.

Open source Code generation Benchmarks

SIG

HYP

Reddit r/MachineLearning·May 23

LQS v3.1 — an open methodology for rating AI training data (multi-oracle consensus + signed certificates) [P]

LQS v3.1 is an open-source methodology for rating AI training data quality. It uses 19 dimensions (label correctness, contamination, equity, etc.), multi-oracle consensus (7 oracles) with real-world outcome recalibration, and offline-verifiable Ed25519 certificates. Free public index with 263 scored datasets.

Evals Open source AI safety

SIG

HYP

Reddit r/LocalLLaMA·May 22

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

Qwen3.6 27B quantized to Q4_K_M fits in 16 GB VRAM (15.4 GB MTP, 15.1 GB non-MTP). MTP version reaches 40 tok/s generation speed, non-MTP 24 tok/s. GGUF available on HuggingFace for llama.cpp.

Qwen Open source Tools

SIG

HYP

Reddit r/LocalLLaMA·May 22

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

User achieves 30+ tokens/sec with Qwen3.6-35B-A3B Q4 quantized on RTX 3070 Ti 8GB with 262k context. Key: MoE model only needs 3.5B active in VRAM. Linux Server yields +25% tps vs Windows 11. Contexts up to 1M possible but slowdown beyond 150k.

Qwen Open source

SIG

HYP

Simon Willison·May 22

The memory shortage is causing a repricing of consumer electronics

HBM memory demand for AI data centers is rising from 2% to 20% of wafer capacity by end of 2026. The three major memory manufacturers favor under-provisioning over over-provisioning. Result: budget smartphones (< $100) and mobile devices will see significant cost increases.

Business Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·May 22

I fine-tuned Cohere Transcribe to support diarization and timestamps

Developer fine-tuned Cohere Transcribe to add diarization (speaker identification) and timestamps. Model outputs parsable format with average temporal precision of ±0.097s. Supports up to 4 speakers per 30s, extensible to 32 with diarize_long.py script. Available free on Hugging Face.

Open source Fine-tuning Voice

SIG

HYP