Page 16 of 138

AllHigh signalRecent
5487 articles
arXiv cs.AI·

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

SVFSearch is a multimodal benchmark for short-video frame search in the Chinese gaming domain. It contains 5,000 test examples and 4,198 training examples based on real game scenes. Evaluation compares direct QA, RAG, Plan-Act-Replan agents, and learned search models: best open-source model reaches 66.4%, best practical agent 79.1%, oracle 95.4%.

BenchmarksAI AgentsRAG
SIG
78
HYP
15
arXiv cs.AI·

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

SCICONVBENCH benchmarks LLMs on multi-turn clarification of ill-posed scientific problems across fluid mechanics, solid mechanics, materials science, and PDEs. Best models resolve only 52.7% of disambiguation cases in fluid mechanics, but perform better on inconsistency detection. Evaluates clarification behavior, conversational grounding, and specification fidelity.

BenchmarksReasoningCode generation
SIG
78
HYP
15
arXiv cs.AI·

PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics

PH-Dreamer embeds Port-Hamiltonian physical principles into world models to improve latent imagination. The framework models energy evolution, estimates the Hamiltonian from proprioceptive observations, and uses an energy-guided Actor-Critic. Results: 4.18-8.41% phase space volume reduction, up to 7.80% energy consumption decrease, up to 9.38% jerk reduction.

ReasoningReinforcement learningPapers
SIG
78
HYP
15
arXiv cs.AI·

Membership Inference Attacks on Discrete Diffusion Language Models

Study of membership inference attacks (MIA) on masked diffusion language models (MDLM). Researchers extract 46-dimensional feature vectors from reconstruction loss at different masking ratios and train XGBoost and MLP classifiers. On MIMIR benchmark, XGBoost achieves AUC 0.878 (peak 0.930), outperforming SAMA baseline by 0.062 AUC. ELBO trajectory alone drives most of the signal.

AI safetyBenchmarksPapers
SIG
78
HYP
15
arXiv cs.AI·

Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex

MINE (Mechanistically Interpretable Neural Encoding) applies mechanistic interpretability to neural encoding models to identify visual features driving activation in individual voxels of human visual cortex. Using language-aligned image representations and counterfactual editing, the approach causally validates fine-grained selectivity in category-selective brain regions.

VisionPapers
SIG
78
HYP
15
arXiv cs.LG·

PIMSM: Physics-Informed Multi-Scale Mamba for Stable Neural Representations under Distribution Shift

PIMSM embeds physical constraints into a multi-scale Mamba architecture to improve representation stability under distribution shift. The model aligns discretization parameters to characteristic frequencies in temporal series (fMRI, weather). Results: improved robustness on Human Connectome Project and Weather-5K with minimal MAE in out-of-distribution forecasting.

ReasoningBenchmarksPapers
SIG
78
HYP
18
arXiv cs.AI·

Evaluating AI Alignment in LLMs: Output Analysis of Value Priorities Across 75 Models with Human Benchmarking

Alignment evaluation across 75 LLMs benchmarked against 376 humans. Qualitative analysis derives 6 themes of optimal AI functioning (Performance, Adaptive Capacity, Social Good, Ethics and Responsibility, Relational Integration, Agency). Models reproduce human value ordering but systematically exaggerate differences. Profile fidelity does not correlate with model size or recency.

AlignmentEvalsBenchmarks
SIG
78
HYP
25
arXiv cs.AI·

Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning

BPO, a three-stage framework (bootstrapping, extrapolation, refinement), creates a self-improving data flywheel to train robust reasoning models for long-horizon sparse-reward planning. Uses planning quaternions, long-short chain-of-thought fusion, and complexity-stratified curriculum learning. SOTA on ALFWorld, ScienceWorld, WebShop with significant token efficiency.

ReasoningAI AgentsReinforcement learning
SIG
78
HYP
25
arXiv cs.AI·

Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning

R-AIRL (Reasoning Adversarial Inverse Reinforcement Learning) infers process-level reward functions from expert Chain-of-Thoughts without explicit reward definitions. Tested on GSM8K, MMLU-Pro, and MedReason: improves pass@1 by 17.4 points via inference-time reranking, outperforms SFT in post-training, localizes reasoning failures with 86.1% accuracy.

Reinforcement learningReasoningEvals
SIG
78
HYP
25
arXiv cs.AI·

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

GCPO (Group Cooperative Policy Optimization) replaces competitive rollout optimization with team-level credit assignment. Rollouts are rewarded by contribution to valid solution coverage (determinant volume over semantic embeddings), not individual accuracy. Results: improved reasoning accuracy and solution diversity across benchmarks.

Reinforcement learningReasoningBenchmarks
SIG
78
HYP
25
arXiv cs.AI·

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

arXiv paper showing LLMs exhibit a knowing-doing gap in tool use: recognition of tool necessity vs. actual invocation diverge. Testing 4 models on arithmetic and factual QA reveals 26.5-54% mismatches. Hidden state probing shows cognition and action signals become nearly orthogonal in late layers, with most failures at the cognition-to-action transition, not in recognition itself.

AI AgentsToolsReasoning
SIG
78
HYP
15
arXiv cs.AI·

PyHealth 2.0: A Comprehensive Open-Source Toolkit for Accessible and Reproducible Clinical Deep Learning

PyHealth 2.0 is an open-source clinical deep learning toolkit reducing barriers to medical AI research. It unifies 15+ datasets, 20+ clinical tasks, 25+ models, and 5+ interpretability methods in a single framework supporting signals, imaging, and electronic health records. Delivers 39x speedup and 20x memory reduction, with 400+ community members.

Open sourceCode generationEvals
SIG
78
HYP
25