Page 76 of 149

AllHigh signalRecent

5935 articles

Semantic Smoothing via Novel View Synthesis for Robust SAR Image Classification

Adversarial defense for SAR image classification using semantic smoothing. Replaces isotropic noise with structured geometric transformations generated by novel view synthesis, conditioned on acquisition geometry. Improves robustness against FGSM, PGD, OTSA, SMGAA while increasing clean classification accuracy.

AI safety Vision Evals

SIG

HYP

arXiv cs.AI·May 19

Counterparty Modeling is Not Strategy: The Limits of LLM Negotiators

Researchers test LLM agents' negotiation abilities in a controlled multi-attribute bargaining environment. Agents accurately model counterparty preferences but fail to convert this knowledge into winning strategy. Final agreements are driven by opening anchors rather than actual utility weights.

Reasoning AI Agents Evals

SIG

HYP

arXiv cs.AI·May 19

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Long reasoning models (LRMs) generate redundant chains of thought uncorrelated with correctness. The paper discovers LRMs implicitly know when to stop thinking. SAGE (Self-Aware Guided Efficient Reasoning) exploits this via a novel sampling paradigm, improving accuracy and efficiency on mathematical benchmarks.

Reasoning Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models

Novel LGPT (Learnable Graph Pooling Token) method to integrate graphs into LLMs. Uses learnable tokens to represent graphs without information loss. 4.13% improvement on GraphQA benchmark without LLM fine-tuning.

Prompt engineering RAG Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Online Algorithms with Unreliable Guidance

New arXiv paper introducing OAG (Online Algorithms with Unreliable Guidance), a model for ML-augmented online decision-making separating predictive and algorithmic components. Presents DTB (drop-or-trust-blindly) compiler converting standard online algorithms into learning-augmented versions. Demonstrates optimal guarantees on bipartite matching, caching, and uniform metrical task systems.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.AI·May 19

Latency-Aware Deep Learning Benchmark for Real-Time Cyber-Physical Attack and Fault Classification in Inverter-Dominated Power Grids

Latency-aware benchmark for 8 deep learning architectures (MLPs, Transformers) in anomaly detection on inverter-dominated power grids. Real-time classification < 15 ms per cycle, but end-to-end latency 50-90 ms (3+ cycles). Critical gap between algorithmic capability and protection-grade deployment identified.

Benchmarks Reasoning

SIG

HYP

arXiv cs.CL·May 19

ANVIL: Analogies and Videos for Lecturers

ANVIL is a multimodal generative system automating production of analogy-based instructional animations for computer science. Given a concept definition, it generates textual analogies, compiles them into structured visual screenplays, and produces executable manim code. Evaluation combines teacher judgments and LLM-based automated screening.

Code generation Vision Evals

SIG

HYP

arXiv cs.CL·May 19

AI Slop or AI-enhancement? Student perceptions of AI-generated media for an English for Academic Purposes course

Implementation study of Google Notebook LM generating videos, podcasts, and infographics in an English for Academic Purposes course (106 students, Hong Kong). Students rated high perceived usefulness and ease of use; preference for visual/multimodal content. Positive correlation between video preference and academic performance, but higher cognitive load negatively associated with grades.

RAG Tools Evals

SIG

HYP

arXiv cs.AI·May 19

Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies

Novel mechanistic interpretability approach for Vision-Language-Action (VLA) robot policies. Authors propose sparse autoencoders (SAE) grounded in behavioral events rather than text contexts. Evaluation on OpenVLA and π₀.₅ across simulation and real-robot experiments, with code released.

Vision Robotics AI Agents

SIG

HYP

arXiv cs.AI·May 19

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

PluRule is a multimodal, multilingual benchmark for moderating pluralistic communities on social media. It covers 13,371 rule violations across 1,989 Reddit communities and 2,885 rules in 9 languages. State-of-the-art vision-language models, including GPT-4.5 with advanced reasoning, only marginally outperform a trivial baseline, revealing that pluralistic moderation remains a fundamental challenge.

Benchmarks Vision AI safety

SIG

HYP

arXiv cs.AI·May 19

General-purpose LLMs as Models of Human Driver Behavior: The Case of Simplified Merging

Study comparing OpenAI o3 and Google Gemini 2.5 Pro as models of human driving behavior in a simplified merging scenario. LLMs reproduce intermittent operational control and tactical dependencies, but fail to capture responses to dynamic velocity cues. Prompt ablations reveal model-specific inductive biases that do not transfer across LLMs.

GPT Gemini Reasoning

SIG

HYP

arXiv cs.AI·May 19

HAAS: A Policy-Aware Framework for Adaptive Task Allocation Between Humans and Artificial Intelligence Systems

HAAS is a framework for adaptive task allocation between humans and AI systems in software engineering and manufacturing. It combines rule-based governance constraints with contextual-bandit learning. Results show governance is not binary but a tunable design variable: moderate governance improves operational performance and reduces fatigue in manufacturing while remaining competitive as the learner gains experience.

AI Agents Multi-agent Reinforcement learning

SIG

HYP

arXiv cs.AI·May 19

New Wide-Net-Casting Jailbreak Attacks Risk Large Models

arXiv paper identifies a new jailbreak attack class: "wide-net-casting" where adversaries query multiple large models simultaneously to bypass safeguards. Researchers develop a tailored jailbreak method achieving 100% success rate on unprotected models in some experiments, exposing significant safety risks.

AI safety Alignment Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Stabilizing Temporal Inference Dynamics for Online Surgical Phase Recognition

New framework to stabilize temporal predictions in surgical phase recognition. Introduces TEC loss (training), EGTP (inference), and TFI (metric). Reduces prediction fragmentation on Cholec80 and AutoLaparo while maintaining frame-wise accuracy.

Vision Reasoning Evals

SIG

HYP

arXiv cs.AI·May 19

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

SEMA-RAG is a multi-agent framework for retrieval-augmented generation applied to medical reasoning. It decomposes the process into three specialist agents: clinical interpretation, iterative document exploration, and evidence adjudication. Tested on 5 benchmarks and 5 LLM backbones, it improves baselines by +6.46 accuracy points on average.

Multi-agent RAG Reasoning

SIG

HYP

arXiv cs.AI·May 19

Visual Timelines of Police Encounters in Body-Worn Camera Footage: Operational Context and Activity Cataloging for Training and Analysis in OpenBWC

Approach to process body-worn camera (BWC) video into 10-second windows labeled by operational context and motion intensity. Models trained with CLIP and optical flow: 78.75% accuracy for context, 88.33% for activity. Privacy-conscious protocol to speed up incident review and officer training workflows.

Vision Benchmarks AI safety

SIG

HYP

arXiv cs.AI·May 19

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

VLATIM, a new benchmark based on The Incredible Machine 2, evaluates Vision-Language Models' logical reasoning in point-and-click puzzle games. Results reveal a significant gap: large proprietary models excel at planning but struggle with precise visual grounding, failing to match human-level problem-solving.

Vision Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Causal Bias Detection in Generative Artificial Intelligence

arXiv paper proposing a theoretical framework for detecting causal bias in generative AI models. Authors formalize causal fairness specific to generative models (vs standard ML), derive causal decompositions to quantify bias impacts across different causal pathways, and demonstrate their methodology by analyzing race and gender bias in large language models.

Papers AI safety Alignment

SIG

HYP

arXiv cs.CL·May 19

Can LLMs Generate and Solve Linguistic Olympiad Puzzles?

Study on LLMs' (including OpenAI o1) ability to solve and generate linguistic puzzles from Linguistic Olympiads. Models outperform humans on most puzzle types except writing systems and understudied languages. Automated puzzle generation could expand interest in linguistics and support rare language dissemination.

GPT OpenAI Benchmarks

SIG

HYP

arXiv cs.CL·May 19

Prompt reinforcing for long-term planning of large language models

Prompt optimization framework inspired by reinforcement learning to improve long-term planning in LLM multi-turn interactions. Method modifies only task instruction via turn-by-turn feedback and experience replay. Significant improvements on text-to-SQL and task-oriented dialogue, generalizes across LLM agents.

Prompt engineering Reinforcement learning AI Agents

SIG

HYP

arXiv cs.AI·May 19

Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation

A PPE framework uses one-class density estimators with fused text embeddings to detect contextual data leakage in RAG systems. The T3+OCSVM detector achieves 0.93+ AUROC, reduces false positives by 44-55 percentage points, and maintains millisecond latency, outperforming supervised MLP classifiers and 14B-parameter LLM judges.

RAG AI safety Embeddings

SIG

HYP

arXiv cs.AI·May 19

When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited

Behavior Foundation Models (BFMs) enable scalable imitation learning but fail under dynamics shifts (friction, actuation, noise). This paper formulates BFM task-inference as robust minimax optimization, enabling adaptation to worst-case dynamics perturbations without retraining. The framework outperforms standard BFM and robust offline IL baselines under dynamics shifts.

Reinforcement learning Papers Evals

SIG

HYP

arXiv cs.AI·May 19

Revisiting Long-term Time Series Forecasting: An Investigation on Linear Mapping

arXiv study on long-term time series forecasting (LTSF). Authors show that a simple linear layer (affine mapping) dominates performance on standard benchmarks. Analysis reveals models learn similar transition matrices, capture periodic patterns well but fail on non-periodic signals. Code available.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.AI·May 19

Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science

SEED is a framework representing experimental conditions as typed actor-flow graphs to study multi-agent systems and human-AI workflows. It enables describing conditions, evaluating structural novelty, and generating candidate designs under constraints. Empirical test on medical-triage task shows SEED-guided designs provide clearer interaction changes, assumptions, and governance checks.

AI Agents Multi-agent Evals

SIG

HYP

arXiv cs.AI·May 19

$\texttt{SynC}$: Synergistic Boosting of Structure and Representation for Deep Graph Clustering

SynC, a deep graph clustering framework, leverages synergistic relationship between representation learning and structure augmentation via a Transform Input Graph Auto-Encoder (TIGAE). The model shares weights across two stages to reduce parameters and improves generalization on low homophily graphs.

Benchmarks Papers

SIG

HYP

arXiv cs.CL·May 19

Readers make targeted regressions to plausible errors in reanalysis of "noisy-channel garden-path" sentences

Psycholinguistic study of reading dynamics for noisy-channel garden-path sentences. Readers make targeted eye-movement regressions toward regions likely containing errors, confirming a noisy-channel processing model with reanalysis.

Reasoning

SIG

HYP

arXiv cs.AI·May 19

SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models

SAFE-SVD proposes a compression method for physics foundation models (PFMs) that preserves physical fidelity. The technique models layer sensitivity in the output function space, avoiding severe performance degradation caused by conventional methods. Experiments show substantial gains in compression ratios while maintaining accuracy across multiple models and datasets.

Papers Benchmarks Infrastructure

SIG

HYP

arXiv cs.AI·May 19

Leveraging Unsupervised Learning for Cost-Effective Visual Anomaly Detection

Visual anomaly detection system using unsupervised learning on Raspberry Pi. Training and inference in 90 seconds with 10 normal images, F1 score >0.95. Deployment via Anomalib and openVINO for SMEs.

Vision Open source Tools

SIG

HYP

arXiv cs.AI·May 19

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

SkillTTA synthesizes task-specific textual skills by retrieving relevant training trajectories, with adaptation through context only—no parameter updates. Evaluated on SpreadsheetBench, ALFWorld, and BigCodeBench: Pass@1 improves from 0.397 to 0.505 on SpreadsheetBench, from 0.517 to 0.651 on BigCodeBench.

AI Agents Prompt engineering Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Extending Pretrained 10-Second ECG Foundation Models to Longer Horizons

Method to extend ECG foundation models (pretrained on 10-second segments) to longer and variable-length recordings. A lightweight plug-in module adds long-sequence processing and temporal modeling without retraining the backbone. Results on multiple long-horizon ECG tasks outperform sliding-window and pooling baselines.

Papers Fine-tuning Vision

SIG

HYP

arXiv cs.AI·May 19

Latent Action Control for Reasoning-Guided Unified Image Generation

LAC (Latent Action Control) makes reasoning actionable in unified generative models by representing planning and diagnosis as continuous hidden actions. Integrated into BAGEL-7B-MoT, LAC improves compositional and knowledge-grounded generation via variational alignment and GRPO, with major gains on spatial relations and attribute binding.

Image generation Reasoning Code generation

SIG

HYP

arXiv cs.AI·May 19

Uncertainty Quantification as a Principled Foundation for Explainable Artificial Intelligence: A Case Study of Counterfactual Explanations

arXiv paper proposing counterfactual explainability grounded in uncertainty quantification. Authors demonstrate that integrating foundational AI concepts—particularly uncertainty—improves robustness and reliability of explanations, achieving competitive performance despite radically simple design.

SIG

HYP

arXiv cs.CL·May 19

Merlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting

Whisper, an iterative prompting framework, reduces response length in large reasoning models (LRMs) via black-box persuasive prompting. 3x reduction on GSM8K for Qwen3; ~40% average token savings across benchmarks. Claude-3.7 and Gemini-2.5 achieve -46% to -50% on MATH-500.

Prompt engineering Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Adaptive Camera Sensor for Vision Models

Lens, a camera sensor control method, adapts acquisition parameters in real-time to improve vision model performance. Using VisiT, a training-free quality indicator based on confidence scores, Lens compensates for domain shift without extensive model modification. ImageNet-ES Diverse benchmark introduced.

Vision Benchmarks Evals

SIG

HYP

arXiv cs.AI·May 19

Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks

Metric-guided fusion approach combining complementary features from visual foundation models (SAM2, DINOv3) for dense prediction tasks. Two label-free metrics (Structural Coherence, Edge Fidelity) assess encoders and select complementary pairs. Consistent performance gains across multiple tasks without complex architectural changes.

Vision Benchmarks Open source

SIG

HYP

arXiv cs.AI·May 19

Long Context Modeling with Ranked Memory-Augmented Retrieval

ERMAR (Enhanced Ranked Memory Augmented Retrieval) is a framework for effective long-context management in language models. It employs a novel relevance scoring mechanism and pointwise re-ranking model for key-value embeddings, inspired by learning-to-rank techniques. Achieves SOTA results on standard benchmarks with superior scalability and performance.

RAG Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·May 19

Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning

XDiffuser combines state-space graph planning with diffusion to improve long-horizon planning. The model first computes a classical plan serving as a lightweight connectivity oracle, then uses it to guide denoising of a single trajectory. Outperforms diffusion baselines on long-horizon tasks, multi-agent coordination, and TSP-style reasoning.

Reasoning

SIG

HYP

arXiv cs.CL·May 19

Beyond Neural Incompatibility: Cross-Scale Knowledge Transfer in Language Models through Latent Semantic Alignment

SemAlign enables cross-scale knowledge transfer between language models via latent semantic alignment. Instead of direct parameter copying, the method uses activations as transfer medium, pairing source and target layers and optimizing through semantic supervision. Evaluated on four benchmarks.

Fine-tuning Reasoning Papers

SIG

HYP

arXiv cs.AI·May 19

Content-Style Identification via Differential Independence

New arXiv paper introducing CSDI (content-style differential independence) to identify content and style factors in multi-domain generative models. Relaxes prior statistical independence conditions via blockwise orthogonality constraints on Jacobian subspaces. Demonstrates identifiability even with dependent content/style and dense Jacobians.

Papers Image generation Reasoning

SIG

HYP

arXiv cs.CL·May 19

T-FIX: Text-Based Explanations with Features Interpretable to eXperts

T-FIX is an evaluation framework to measure alignment of LLM-generated explanations with expert reasoning in specialized domains (surgery, astronomy, therapy). Covers seven scientific tasks across three domains with expert-defined criteria. Enables automatic, generalizable evaluation without ongoing expert annotation.

Evals Reasoning AI safety

SIG

HYP