Page 56 of 146

AllHigh signalRecent

5831 articles

ChildEval: When large language models meet children's personalities

ChildEval is a benchmark with 29K synthesized child personality profiles (ages 3-6) to evaluate LLMs' ability to infer and follow child-centered preferences in long-context conversations. The dataset covers 5 top-level and 14 sub-level categories of daily life. Results show that fine-tuning on ChildEval enhances child-centered performance.

Benchmarks Fine-tuning Evals

SIG

HYP

arXiv cs.CL·May 28

Syllabic-Structure Decoder for Automatic Speech Recognition in Vietnamese

New ASR approach for Vietnamese using syllabic-structure phoneme-based decoding. Model captures phonological composition of syllables instead of orthographic units, reducing vocabulary size. Outperforms PhoWhisper and Wav2Vec2 on LSVSC and UIT-ViMD benchmarks.

Voice Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 28

Faster Thermal Profiling of a Lunar Rover with Machine Learning Adapted Finite Difference Model

A physics-informed machine learning (PIML) framework for thermal modeling of a lunar rover. An adaptive neural network determines 3D finite-difference meshing based on thermal loads, improving accuracy by 50% vs coarse-mesh physics models and 39% vs pure ANN, while being 3x faster than high-fidelity simulations.

Reasoning Benchmarks Robotics

SIG

HYP

arXiv cs.LG·May 28

How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks

Comparative study of Muon vs Adam optimizers on equivariant neural networks (ModelNet40, molecular data). Muon consistently outperforms Adam. Hessian and spectral analysis shows Muon produces more regular loss surfaces and learned representations with higher effective rank.

Benchmarks Reasoning

SIG

HYP

arXiv cs.LG·May 28

Test-Time Collective Action: Proxy-Based Perturbations for Correcting Algorithmic Harms

New framework enabling user collectives to correct algorithmic disparities without platform intervention. Test-Time Collective Action (TTCA) uses universal perturbations derived from a proxy model to improve fairness without training access. Validation on CIFAR-10, CIFAR-100, and FairFace demonstrates closure of subgroup accuracy gaps and improved worst-group accuracy.

AI safety Alignment Evals

SIG

HYP

arXiv cs.LG·May 28

SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training

SparseOpt, a sparsity-aware optimizer, addresses gradient skew induced by Batch Normalization in Dynamic Sparse Training. Experiments on ResNet (CIFAR-100, ImageNet) show faster convergence and improved generalization. First systematic study of interactions between Batch Normalization, sparse layers, and DST.

Papers Benchmarks Fine-tuning

SIG

HYP

arXiv cs.AI·May 28

Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

WIRE is an evaluation pipeline that diagnoses rule conflicts within a single LLM agent prompt policy. Across 6 public policies, it extracts 276 rules and identifies 170 hard-collision rule pairs. Only 35.4% of tested cases comply with both rules simultaneously; 64.6% violate at least one source rule.

AI Agents Prompt engineering Evals

SIG

HYP

arXiv cs.LG·May 28

Balancing Fidelity and Diversity in Diffusion Models via Symmetric Attention Decomposition: Hopfield Perspective

Theoretical paper decomposing the pre-softmax attention matrix QK^T into symmetric and skew-symmetric components. The symmetric part governs the energy landscape, the skew-symmetric part drives circulation. Authors propose Hopfield-style stability measures to quantify fidelity-diversity trade-offs in generation and a controllable mechanism to modulate this trade-off.

Reasoning Papers Vision

SIG

HYP

arXiv cs.AI·May 28

GraD-IBD: Graph Representation Learning from Diagnosis Trajectories for Early Detection of Inflammatory Bowel Disease

GraD-IBD reformulates longitudinal ICD trajectories as temporally directed graphs to detect inflammatory bowel disease risk early. A context-aware time-decay message passing mechanism captures temporal dependencies with reduced complexity. Robust results on real-world clinical data.

SIG

HYP

arXiv cs.LG·May 28

Bayesian Deployment Approval for Learned Landing Controllers under Finite Rollout Validation

Bayesian framework for validating deployment of learned autonomous landing controllers. Uses Bayesian inference to quantify uncertainty about true policy capability beyond empirical metrics (reward, success rate). Experiments with PPO and SAC show empirical optimization overconfidence, while Bayesian inference better calibrates deployment readiness assessment.

Reinforcement learning AI safety Robotics

SIG

HYP

arXiv cs.CL·May 28

TARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition

TARQ is a post-training quantization method for ASR that rebalances calibration toward rare words (names, numerals, domain-specific terms). Without labels or validation, it improves rare-WER across 8 backbones and 6 datasets at W4G128 without aggregate-WER regression.

Benchmarks Papers Code generation

SIG

HYP

arXiv cs.AI·May 28

Revealing Algorithmic Deductive Circuits for Logical Reasoning

Study localizing logical reasoning mechanisms in LLMs. Researchers identify attention heads responsible for individual reasoning steps via causal mediation analysis. Finding: ~3% of heads handle fact/rule retrieval, higher layers coordinate global information integration and graph traversal strategies.

Reasoning Papers

SIG

HYP

arXiv cs.AI·May 28

A Query Engine for the Agents

Hyperparam introduces three JavaScript libraries (<70 KB) to query Parquet and Apache Iceberg directly from object storage in client-side applications (Claude Code, Cursor). The system runs LLM-shaped async UDFs 300x faster than DuckDB-WASM on filter-bounded queries and reduces costs of a ten-task agent analyst suite by two-thirds.

AI Agents Claude Code Tools

SIG

HYP

arXiv cs.AI·May 28

Auditable Decision Models with Learned Abstention and Real-Time Steering

EvaluatorDPT is a bounded decision-control model predicting YES, NO, or TBD (learned deferral). Using a transformer encoder with structured auxiliary heads, it achieves Accuracy=0.8260 and Macro F1=0.8252 on 44,597 test samples. The interface enables inspectable routing and auditable decision control for production AI systems.

Reasoning Evals AI safety

SIG

HYP

arXiv cs.LG·May 28

High-Fidelity Industrial Crash Dynamics Prediction via Geometry-Aware Operator Learning with Memory-Efficient Low-Rank Attention

GeoTransolver, a geometry-aware operator learning framework, accurately predicts industrial-scale automotive crash dynamics. On bumper beam and full-vehicle crash datasets, it captures plastic deformations and acceleration profiles. A FLARE-based modification reduces memory overhead by 2x while improving accuracy for high-frequency transients.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 28

EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

EAPO is an adaptive policy optimization method for training reasoning models in open-ended QA. It dynamically adjusts positive/negative sample weights based on current-to-initial entropy ratio to preserve exploration and stability. Tests on two medical QA datasets show improvements in diversity and stability versus fixed-weight baselines.

Reinforcement learning Reasoning Evals

SIG

HYP

arXiv cs.CL·May 28

LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks

LCO (LLM-based Constraint Optimization) is a framework reducing in-context reward hacking (ICRH) in autonomous LLMs without fine-tuning. Two modules: self-thought for integrating safety constraints, and evolutionary sampling to keep actions in safe solution space. On GPT-4, achieves 39% reduction in toxicity growth rate and 15.23% reduction in ICRH occurrence.

AI Agents AI safety Alignment

SIG

HYP

arXiv cs.LG·May 28

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

SC-SDPO improves LLM self-distillation by weighting losses with √[p(1-p)], creating an implicit curriculum. Experiments on Qwen3-8B (+3.2/+4.3 mean@16/maj@16) and OLMo-3-7B (+1.8/+3.0) show stable gains with zero computational overhead.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·May 28

Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

Modular LLM-based architecture to detect and quantify intensity of human values in text. Three coordinated modules: generating value specifications from theoretical frameworks, labeling texts, assigning graded support/resistance based on rhetorical and semantic evidence. Evaluated on ValueEval dataset with multiple LLMs, demonstrating pipeline generality.

Alignment Evals Reasoning

SIG

HYP

arXiv cs.LG·May 28

Supervised Distributional Reduction via Optimal Transport and Dependence Maximization

SDR (Supervised Distributional Reduction) combines optimal transport and dependence maximization to learn target-aware representations. The algorithm extends the Fused Gromov-Wasserstein objective with an explicit dependence term, producing compact embeddings that capture both geometric structure and predictive signal. Application to Gaussian Process modelling with adaptive kernels.

Papers

SIG

HYP

arXiv cs.AI·May 28

Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models

Hierarchical framework for compact LLMs in resource-constrained agentic systems. Model distillation + oracle-controller loop monitors protocol validity, projects histories into feasible prompt domain, triggers lightweight fine-tuning under drift. Separates schema learning from semantic adaptation. Evaluated on Multi-Fidelity Bayesian Optimization with improved reliability and cost-efficiency.

AI Agents Fine-tuning Prompt engineering

SIG

HYP

arXiv cs.CL·May 28

UniMaia: Steering Chess Policies with Language for Human-like Play

UniMaia is a framework controlling a chess policy (Lc0) via natural language prompts without full multimodal retraining. A lightweight text encoder and ControlNet-style mechanism enable gameplay modulation (opening selection, strength). UniMaia-Aux adds temporal and behavioral prediction objectives. SOTA results on prompt-conditioned benchmarks.

Prompt engineering Reasoning Fine-tuning

SIG

HYP

arXiv cs.CL·May 28

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

VibeSearchBench evaluates LLM agents on collaborative multi-turn search in real-world context. The benchmark comprises 200 bilingual (Chinese/English) tasks across 20 domains with schema-free knowledge graphs. Seven frontier models tested achieve max F1 of 30.30, exposing gaps in long-context reasoning and proactive intent elicitation.

Benchmarks AI Agents Reasoning

SIG

HYP

arXiv cs.LG·May 28

Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

RL framework for sim-to-real policy transfer via probabilistic latent embeddings and dynamic adaptation. Uses meta-RL and CMDPs to infer latent environment representation, with distributional RL formulation dynamically adjusting risk levels based on latent context estimation accuracy.

Reinforcement learning Robotics AI safety

SIG

HYP

arXiv cs.LG·May 28

Can Entry-Wise Clipping Give Spectral Control of Stochastic Gradients?

Theoretical paper on spectral control of stochastic gradient noise via entry-wise clipping. Shows that simple entry-wise clipping balances matrix structure and computational cost, with O(ε⁻⁴) convergence guarantees under Cauchy-contaminated noise. Empirical gains: ~7% token savings on NanoGPT with smooth shrinkage, ~2% additional when combined with Muon.

Papers Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·May 28

Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access

Agyn is an open-source platform for deploying AI agents in production. It provides a stateful serverless runtime on Kubernetes, agent definition via Terraform, and a zero-trust security model. Agyn is agent-, model-, and cloud-agnostic.

AI Agents Open source Infrastructure

SIG

HYP

arXiv cs.LG·May 28

Gradient Transformer: Learning to Generate Updates for LLMs

Gradient Transformer, a data-free knowledge distillation framework, generates LLM update vectors from TinyLMs fine-tuned on private data. The model captures correlation between gradient vectors of both models, enabling collaborative adaptation without accessing sensitive data.

Fine-tuning Reasoning

SIG

HYP

arXiv cs.LG·May 28

$E^3$-Agent: An Executable and Evolving Agent for Resource Management of Edge Generative Inference

E³-Agent is an executable and evolving agent for edge generative inference resource management. It pairs a fast-path router (millisecond dispatch) with a slow-path LLM meta-controller driven by events, learning online from execution feedback. Evaluated in simulation, it reduces latency by 65-73% versus static baselines across dynamic scenarios (semantic shifts, device churn, hidden drift).

AI Agents Reasoning Infrastructure

SIG

HYP

arXiv cs.LG·May 28

Evaluating Local Explainability Metrics for Machine Learning Models on Tabular Data

Comparative study of local explainability techniques (LIME, SHAP, Feature Ablation) reliability across 32 tabular datasets. Results show explanation quality does not systematically correlate with model predictive performance, but depends instead on dataset complexity and feature distributions.

Evals RAG

SIG

HYP

Reddit r/MachineLearning·May 28

Diffusion models for sketch-guided trajectory simulation [R]

Diffusion models applied to basketball trajectory simulation conditioned on partial sketches of player movements. The model jointly refines all player trajectories, producing more natural simulations than autoregressive generation. Code and model fully open-sourced.

Video generation Open source

SIG

HYP

Reddit r/LocalLLaMA·May 27

Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop

CPU inference at 10.33 tokens/s on Qwen 3.5 35B quantized Q4_K_M on $300 Lenovo Ideapad Slim 3i (i3-1215U, 8GB RAM). Uses llama.cpp with BIOS optimizations, core pinning, MTP speculative decoding, and Q8_0 K/V cache quantization.

Qwen Code generation Open source

SIG

HYP

The Decoder·May 27

Microsoft's MAI-Image-2.5 pulls even with Google's Nano Banana 2 on benchmarks

Microsoft MAI-Image-2.5 ranks third on Arena's text-to-image leaderboard, matching Google Nano Banana 2 but trailing OpenAI Image-2. The model shows clear improvements in rendering text within images and commercial visuals.

Image generation Benchmarks

SIG

HYP

The Decoder·May 27

AI coding agent Devin maker Cognition more than doubles its valuation to $26 billion in under nine months

Cognition, maker of AI coding agent Devin, raises over $1 billion at a valuation exceeding $26 billion. The funding round reflects massive investor interest in AI coding agents, despite ongoing debate about their real-world value.

Code generation AI Agents Funding

SIG

HYP

The Decoder·May 27

Robinhood lets AI agents trade shares and make credit card purchases for customers

Robinhood enables customers to connect AI agents like Anthropic's Claude to investment accounts via MCP for autonomous stock trading. US regulator FINRA flags this as a new risk area. Robinhood acknowledges the product isn't suitable for all users.

Claude AI Agents MCP

SIG

HYP

The Decoder·May 27

YouTube will try to automatically flag AI videos starting this month

YouTube deploys automatic detection system to flag AI-generated or heavily AI-altered content starting May 2026. Labels will display more prominently: below player for long videos and as overlay on Shorts. Recommendations and monetization unaffected.

Regulation Video generation

SIG

HYP

Reddit r/MachineLearning·May 27

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

noisekit is an open-source CLI to generate annotated degraded speech datasets for realistic STT benchmarking (telecom G.711, ambient noise, reverb). Solves the gap: public datasets (FLEURS, CommonVoice) are too clean to evaluate production performance. HuggingFace AudioFolder compatible, includes PESQ/SNR/NISQA metrics.

Voice Evals Benchmarks

SIG

HYP

Reddit r/MachineLearning·May 27

EMA-Gated Temporal Sequence Compression in Vision Transformers [P]

NeuroFlow is a dynamic routing framework for Vision Transformer video inference. It exploits temporal redundancy via Exponential Moving Average (EMA) of patch-level embeddings to eliminate stationary tokens. Architecture B achieves 55.80× wall-clock speedup (678 ms → 11.9 ms on SigLIP 1792p) at 97.37% embedding fidelity. Code released.

Vision Papers Open source

SIG

HYP

Reddit r/MachineLearning·May 27

Cross-species RSA: same learning rules (BP, PC, STDP, FA) tested against both human fMRI and macaque electrophysiology [P]

Cross-species comparison of learning rules (BP, PC, STDP, FA) tested on human fMRI and macaque electrophysiology (V1/V2/V4/IT). STDP and PC dominate V1/V2 (ρ ≈ 0.30/0.28), conserving human pattern. In IT, alignment depends on model capacity (ResNet-50: ρ ≈ 0.25) rather than learning rule. Code and two papers (arxiv 2604.16875, 2605.22401) available.

Papers Benchmarks Reasoning

SIG

HYP

Reddit r/LocalLLaMA·May 27

Turning every "no thats not what i meant" in chat into actual LoRA training data

A developer built TideForge, a desktop app that converts chat corrections into LoRA training data. Each model reply has a "Teach" button; corrections accumulate as JSONL and trigger PEFT fine-tuning on your base model. Initial test: 110 hand-written corrections on Qwen 0.6B, loss dropped 4.25→0.73, adapter maintained identity across ~30 jailbreak prompts. Free, Windows, GGUF-compatible.

Fine-tuning Open source Tools

SIG

HYP

Reddit r/LocalLLaMA·May 27

Does Engram Do Memory Retrieval in Autoregressive Image Generation?

An Engram module (O(1) hash-keyed associative memory) injected into Transformers for autoregressive image generation on ImageNet 256×256 fails to improve quality (FID) despite FLOP gains. Gate-clamp, donor-probe, and frozen-table experiments show the module acts as a gated architectural side-pathway, not a content-addressed retrieval mechanism.

Papers Image generation Benchmarks

SIG

HYP