Page 80 of 149

AllHigh signalRecent

5942 articles

Early Stopping Chain-of-thoughts in Large Language Models

ES-CoT detects answer convergence during chain-of-thought generation to stop inference early. The method reduces inference tokens by 16.08% on average across six reasoning benchmarks while maintaining comparable accuracy to standard CoT.

Reasoning Prompt engineering Benchmarks

SIG

HYP

arXiv cs.CL·May 19

When TableQA Meets Noise: A Dual Denoising Framework for Complex Questions and Large-scale Tables

EnoTab is a dual denoising framework for TableQA addressing complex questions and large-scale tables. It decomposes questions into minimal semantic units and prunes tables via an explicit evidence tree with post-order node rollback mechanism for abnormal states. Achieves strong performance on complex TableQA tasks.

Reasoning RAG Benchmarks

SIG

HYP

arXiv cs.CL·May 19

We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

AMBS (Adaptive Multi-Branch Steering) aligns LLMs on three simultaneous objectives (Helpfulness, Harmlessness, Honesty) via a 1-to-N Transformer framework. A shared representation is replicated into N objective-specific pathways with constrained transformations. Results: 56.5% avg WR on LLaMA-2-7B, 189 Tok/s.

Alignment AI safety Reasoning

SIG

HYP

arXiv cs.AI·May 19

Concise and Logically Consistent Conformal Sets for Neuro-Symbolic Concept-Based Models

COCOCO, a post-hoc framework, integrates Conformal Prediction into neuro-symbolic concept-based models (NeSy-CBMs) to improve reliability. It conformalizes concepts and labels jointly via a single deduction-abduction revision step, ensuring consistency, coverage, and conciseness with distribution-free guarantees. Tested on 8 datasets.

Reasoning AI safety Alignment

SIG

HYP

arXiv cs.AI·May 19

Multi-Party Multi-Objective Optimization as Consensus Search: Runtime Analysis of Cross-Party Recombination

Theoretical study of multi-objective evolutionary algorithms for multi-party optimization (MPMOP). On MP-JCG benchmark, payoff-guided mutation requires Θ(n²) fitness evaluations to cross a gap region, while CPR-NSGA-II achieves O(n log n) via cross-party recombination. Runtime analysis on BPBOMST (multi-party minimum spanning tree) with instance-parameterized bounds.

Multi-agent Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

TTE-Flash replaces explicit Chain-of-Thought traces with latent think tokens to accelerate reasoning-aware multimodal representations. TTE-Flash-2B outperforms explicit-CoT counterparts on MMEB-v2 while maintaining constant inference cost. Latent tokens remain interpretable both textually and visually.

Reasoning Vision Embeddings

SIG

HYP

arXiv cs.CL·May 19

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

arXiv study evaluating ChatGPT's consistency in coding communication data across demographic groups (gender, race). Authors adapt an automated scoring framework and test ChatGPT on three collaborative task types. Result: ChatGPT coding shows consistency comparable to human raters across groups.

GPT Evals Benchmarks

SIG

HYP

arXiv cs.AI·May 19

QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI

QQJ is an evaluation framework for generative AI combining expert-designed multi-dimensional rubrics and LLM evaluator calibration on small high-quality annotation sets. Tested on text and image generation, QQJ shows stronger alignment with human judgment than traditional automatic metrics and unconstrained LLM-based evaluators.

Evals Benchmarks Alignment

SIG

HYP

arXiv cs.CL·May 19

Evaluating Language Models' Evaluations of Games

arXiv paper evaluating how language and reasoning models assess board games. Testing 100+ games with 450 human judgments, reasoning models align better with humans than standard LLMs for evaluating game fairness and fun. Paradox: as models approach game-theoretic optimality, their fit to human judgments weakens.

Reasoning Evals Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Unlocking the Potential of Diffusion Language Models through Template Infilling

Template Infilling (TI) is a conditioning methodology for Diffusion Language Models that aligns structural anchors across the entire response space, replacing prefix prompting. Evaluated on mathematical reasoning, code generation, and trip planning, TI achieves 9.40% improvements and accelerates multi-token generation.

Prompt engineering Code generation Reasoning

SIG

HYP

arXiv cs.AI·May 19

PIPER: Content-Based Table Search via profiling and LLM-Generated Pseudoqueries

PIPER is a content-driven retrieval method for tabular datasets using table profiles and LLM-generated pseudoqueries for dense retrieval. It outperforms metadata-based baselines and existing TableQA methods in poor-metadata settings.

RAG Embeddings Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign

Peak-Detector leverages instruction-tuned LLMs for peak detection across physiological signals (ECG, PPG, BCG, BSG) with explainability. A "peak-representation" technique compresses time-series while preserving critical events. The model is optimized via supervised fine-tuning then multi-objective reinforcement learning, evaluated on 7 datasets (6 public benchmarks + 1 real-world cohort).

Reasoning Fine-tuning Reinforcement learning

SIG

HYP

arXiv cs.CL·May 19

HyperPersona: A Multi-Level Hypergraph Framework for Text-Based Automatic Personality Prediction

HyperPersona introduces a multi-level hypergraph framework for text-based automatic personality prediction. The model represents documents, sentences, and words as hyperedges and nodes, capturing global, local, and lexical dependencies. Evaluated on Big Five personality dimensions, it outperforms existing baselines by leveraging textual hierarchy.

Papers Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning

New IBAL method to strengthen MARL robustness against inter-agent interaction disruptions. Framework uses information-theoretic approach to construct attacks that degrade coordination by perturbing observations and actions, then trains agents to remain reliable. Demonstrated improvement over existing baselines and agent-missing scenarios.

Multi-agent Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification

CardioThink, a physician-inspired MLLM framework, structures ECG diagnosis through explicit reasoning stages (rhythm, conduction, morphology, impression) to enhance interpretability. Structured Set Policy Optimization (SSPO) aligns clinical reasoning without manual annotations, outperforming direct prediction approaches across ECG benchmarks.

Reasoning Vision Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

Temporal Aware Pruning for Efficient Diffusion-based Video Generation

TAPE, a training-free pruning method for diffusion-based video generation, reduces computational overhead by intelligently removing tokens while maintaining temporal coherence across frames. It applies temporal smoothing between frames, performs token reselection per layer, and uses timestep-level budget scheduling to prune aggressively at early steps and relax during refinement.

Video generation

SIG

HYP

arXiv cs.AI·May 19

Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation

RGB-only framework for active 3D scene graph (3DSG) generation using fixed external cameras as common prior maps. System fuses onboard and external camera observations in a single hardware-agnostic pipeline, guiding robots toward high semantic uncertainty regions. Single external camera increases initial object recall by +79%.

Vision Robotics AI Agents

SIG

HYP

arXiv cs.AI·May 19

Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning

Response-free item difficulty modelling for multiple-choice questions using fine-tuned transformers. End-to-end approach on item wording eliminates manual feature engineering. Multi-task variant with auxiliary QA objective delivers significant improvements in small-sample regimes.

Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

LISTEN to Your Preferences: An LLM Framework for Multi-Objective Selection

LISTEN is an agentic LLM framework for selecting among multiple options with competing objectives. Two iterative algorithms: LISTEN-U refines a parametric utility function, LISTEN-T uses tournament-style selection on small batches. Evaluated on flight booking, shopping, exam scheduling. Code available.

AI Agents Prompt engineering Reasoning

SIG

HYP

arXiv cs.AI·May 19

Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization

BiKD introduces a bilevel framework to dynamically balance hard and soft losses in knowledge distillation on imbalanced data. A weight generation network produces adaptive per-sample weights guided by a small balanced validation set. Experiments on long-tailed CIFAR-10/100 show improvements over recent balanced distillation methods.

Fine-tuning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Breaking the accuracy-resource dilemma: a lightweight adaptive video inference enhancement

Video inference enhancement method using a fuzzy controller (FC-r) to dynamically adapt model sizes based on available device resources. Leverages spatiotemporal correlation across adjacent frames. Balances inference performance and resource efficiency without scaling up model complexity.

Video generation Reasoning Infrastructure

SIG

HYP

arXiv cs.CL·May 19

GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning

GraphMind combines GNN and LLM for multi-step mathematical reasoning. The framework models reasoning as an evolving heterogeneous graph where nodes (conditions, theorems, conclusions) and edges (logical dependencies) enable dynamic theorem selection and iterative conclusion generation. Improved results on QA benchmarks.

Reasoning AI Agents Benchmarks

SIG

HYP

arXiv cs.AI·May 19

VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment

VolTA-3D is a self-supervised 3D Vision Transformer framework for brain MRI. It aligns global and local tokens in a student-teacher paradigm and enforces fine-grained structural reconstruction. Evaluated on hippocampal segmentation and classification tasks (sex, Alzheimer's), it outperforms random baselines and demonstrates improved transferability across domain shifts.

Vision Papers

SIG

HYP

arXiv cs.AI·May 19

CatalyticMLLM: A Graph-Text Multimodal Large Language Model for Catalytic Materials

QE-Catalytic-V2 is a unified graph-text multimodal LLM for catalytic materials. It integrates property prediction and inverse design in a single shared representation space, eliminating distribution shifts between decoupled models. Demonstrates superior performance on relaxed-energy prediction and inverse design tasks.

Papers Benchmarks Vision

SIG

HYP

arXiv cs.AI·May 19

CAREBench: Evaluating LLMs' Emotion Understanding by Assessing Cognitive Appraisal Reasoning

CAREBench is a benchmark evaluating LLMs' emotion understanding through cognitive appraisal reasoning. Tested on 6 models with complete inferential chain annotations (first/third-person perspectives), it shows stronger models match humans on some tasks but fall short on appraisal reasoning and positive emotion recognition.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.AI·May 19

From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

Shallow neural network agents master the card game Schnapsen through reinforcement learning. RLBot, trained via asynchronous Monte Carlo updates, outperforms MLPBot (supervised imitation) and achieves statistically significant wins against RdeepBot, a search-based baseline. Combining learned value functions with deeper lookahead during gameplay improves performance.

Reinforcement learning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

Distinguishable Deletion (D²) unifies knowledge deletion and refusal for LLM unlearning. The method uses an energy index to erase undesirable knowledge in latent representations rather than specific tokens, avoiding biased deletion and re-emergence of harmful content. Energy-based Unlearning Alignment (EUA) applies this mechanism at training and inference.

AI safety Alignment Papers

SIG

HYP

arXiv cs.AI·May 19

Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

Vision Inference Former (VIF) is a lightweight architectural module improving visual consistency in multimodal models. It continuously injects visual semantics during generation to counter weakening vision-language alignment over long sequences. Tested on 14 benchmarks (reasoning, OCR, tables), VIF improves performance with minimal overhead.

Vision Multi-agent Alignment

SIG

HYP

arXiv cs.AI·May 19

Optimising CSRNet with parameter-free attention mechanisms for crowd counting in public transport

CSRNet optimized with parameter-free attention mechanisms for crowd counting in public transport. Evaluation of PFCA, SA, and SimAM modules on ShanghaiTech dataset. Novel PFCASA combination (PFCA+SA) outperforms parameterized approaches while reducing model size, applicable to edge-deployed systems.

Vision Benchmarks Infrastructure

SIG

HYP

arXiv cs.CL·May 19

Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation

Study evaluating 8 multimodal models (Gemini-2.5-Pro, o3, etc.) on robustness against cognitive biases in Chinese short-video misinformation. Manually annotated dataset of 200 videos across 4 health domains. Gemini-2.5-Pro achieves 71.5/100, o3 scores 35.2. Models are susceptible to social cues like authoritative channel IDs.

Vision Benchmarks AI safety

SIG

HYP

arXiv cs.CL·May 19

ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language

Annotation framework using multilingual Llama3.1 as teacher model to distill expert knowledge for medical text tagging in Polish. DistilBERT achieves F1 > 0.80 across 5 clinical categories (Radiology, Oncology, Cardiology, Hypertension, Pathology) with 500× fewer parameters and 300× lower GPU VRAM than LLMs.

Llama Fine-tuning Code generation

SIG

HYP

arXiv cs.AI·May 19

Dynamics of collective creativity in AI art competitions

Analysis of 130,882 images from 368 remix parties on Artbreeder (13 months). Images converged toward common thematic attractors (steampunk, alien architecture) while becoming simpler. Paradox: more novel parents produced more complex, liked children, yet users preferred remixing less novel images.

Image generation Papers Evals

SIG

HYP

arXiv cs.CL·May 19

"The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework

COMPACT, a multi-teacher CoT distillation framework, adaptively fuses supervisions from multiple LLMs into compact student models. It dynamically weights teacher gradients using three metrics: graph-based consensus, mutual-information-based adaptability, and loss-based difficulty. Achieves SOTA results across benchmarks while mitigating catastrophic forgetting.

Reasoning Fine-tuning Papers

SIG

HYP

Reddit r/MachineLearning·May 19

We built a tool that installs frameworks like ComfyUI, Ollama, OpenWebUI etc on any cloud GPU in one command and saves your whole setup between sessions [R]

swm is an open-source tool automating framework installation (ComfyUI, Ollama, OpenWebUI, vLLM) on cloud GPUs in one command. It aggregates pricing across 10+ providers (RunPod, Vast.ai, Lambda), syncs workspaces via S3, and auto-terminates idle instances after 30 min to cut costs.

Tools Open source Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·May 19

club-5060ti follow-up: cleaner RTX 5060 Ti local LLM recipes, benchmark explorer, and CUDA GPU compatibility notes

Updated club-5060ti project: structured benchmark and recipe repo for local LLMs on RTX 5060 Ti. Includes static results explorer, schema-validated JSON, single/dual-card recipes, llama.cpp/vLLM support. Baseline: RTX 5060 Ti 16GB. Recommends llama.cpp/GGUF for mixed GPUs; vLLM NVFP4/MTP Blackwell-specific.

Open source Benchmarks Infrastructure

SIG

HYP

Reddit r/MachineLearning·May 18

Released a free 9.8M doc Indic multilingual corpus — Hindi, Bengali, Tamil, Telugu + 7 more (CC0, HuggingFace) [P]

Free multilingual corpus of 9.8M documents across 11 Indic languages (Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu, English). 8.4B tokens, CC0 license, available on HuggingFace.

Open source Embeddings

SIG

HYP

Google DeepMind·May 18

Fast-tracking genetic leads to reverse cellular aging

Google DeepMind uses Co-Scientist, an AI agent, to identify genetic factors that successfully rejuvenate human cells. Researchers discovered novel genes involved in cellular aging processes.

DeepMind AI Agents Papers

SIG

HYP

The Decoder·May 18

Cursor's Composer 2.5 matches Opus 4.7 and GPT-5.5 benchmarks at a fraction of the cost

Cursor releases Composer 2.5, a coding model built on Kimi K2.5 and trained on 25x more synthetic tasks than its predecessor. It matches Opus 4.7 and GPT-5.5 benchmark performance at a fraction of the cost.

Code generation Benchmarks Kimi

SIG

HYP

Hugging Face Blog·May 18

Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation

Hugging Face releases a guide for fine-tuning NVIDIA Cosmos Predict 2.5, a robot video generation model, using LoRA/DoRA. The method reduces GPU resource requirements while maintaining generation quality for specialized robotics use cases.

Fine-tuning Video generation Robotics

SIG

HYP

Reddit r/MachineLearning·May 18

Scaling LLMs horizontally: hidden-state coupling without weight modification [R]

Residual Coupling (RC) connects frozen language models in parallel via lightweight learned linear projections, without weight modification. Linear bridges read hidden states from one model and inject additive updates into another's residual stream. On medical data, RC reduces perplexity to 11.02 vs 56.80 for MoE (+80.7%), and improves TruthfulQA by 9.1 percentage points.

Llama Multi-agent Fine-tuning

SIG

HYP