Page 19 of 138

AllHigh signalRecent
5508 articles
arXiv cs.CL·

The Alpha Illusion: Reported Alpha from LLM Trading Agents Should Not Be Treated as Deployment Evidence

Critical study of LLM-based trading agents (FinCon, FinMem, TradingAgents, FinAgent, QuantAgent, FLAG-Trader). Reported Sharpe ratios do not constitute deployment evidence: temporal contamination, unmodeled frictions, and insufficient predictive calibration invalidate results. Proposes P1-P6 reporting protocol and modular architecture with LLMs as auditable information interfaces.

AI AgentsBenchmarksPapers
SIG
78
HYP
15
arXiv cs.AI·

Who Generated This 3D Asset? Learning Source Attribution for Generative 3D Models

First systematic study of source attribution for generated 3D assets. Researchers build a benchmark covering 22 3D generators and propose a hierarchical multi-view multi-modal Transformer detecting fingerprints (cross-view inconsistencies, geometric artifacts, frequency-domain signatures). Results: 97.22% accuracy under full supervision, 77.17% with only 1% training data.

VisionBenchmarksAI safety
SIG
78
HYP
25
arXiv cs.CL·

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

Researchers identify Entropy-Gradient Inversion, a negative correlation between token entropy and logit gradients, as a geometric fingerprint of Large Reasoning Models' reasoning capability. They propose Correlation-Regularized Group Policy Optimization (CorR-PO), an RL method embedding this inversion signature into reward regularization, outperforming baselines across multiple reasoning benchmarks.

ReasoningReinforcement learningBenchmarks
SIG
78
HYP
25
arXiv cs.AI·

EndoCogniAgent: Closed-Loop Agentic Reasoning with Self-Consistency Validation for Endoscopic Diagnosis

EndoCogniAgent is a closed-loop agentic framework for iterative endoscopic diagnosis. It couples fine-grained visual evidence acquisition and multi-step reasoning via self-consistency validation (knowledge and temporal consistency). On EndoAgentBench (6,132 QA pairs from 11 datasets), the system achieves 85.23% accuracy on perception and 71.13% clinical acceptance on reasoning tasks.

AI AgentsReasoningVision
SIG
78
HYP
25
arXiv cs.AI·

ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning

ALIGN is a vision-language framework to infer precise accident coordinates from Bangla news reports and map-based cues. Using an agentic architecture combining OCR, LLM, and vision-language models, the system reduces localization error from 10.9 km to 0.593 km on validation data and 0.465 km on official Dhaka Metropolitan Police records.

VisionAI AgentsMulti-agent
SIG
78
HYP
25
arXiv cs.AI·

QuickLAP: Quick Language-Action Preference Learning for Semi-Autonomous Agents

QuickLAP fuses physical and language feedback to learn robot reward functions in real time using a Bayesian framework. LLMs extract reward feature attention masks and preference shifts from free-form utterances, integrated with physical corrections via closed-form update rule. Achieves 70% error reduction vs physical-only and heuristic multimodal baselines in semi-autonomous driving simulator.

AI AgentsReinforcement learningReasoning
SIG
78
HYP
25
arXiv cs.CL·

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Vision-OPD introduces regional-to-global self-distillation to improve fine-grained visual understanding in MLLMs. The framework transfers the model's privileged perception on evidence-centered crops to its full-image policy via token-level KL divergence minimization on on-policy rollouts. Competitive results on fine-grained visual understanding benchmarks without external models or ground-truth labels.

VisionReinforcement learningPapers
SIG
78
HYP
25
arXiv cs.AI·

Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study

RL-trained Lean theorem provers suffer mode-collapse at inference: doubling sampling from k=32 to k=64 on miniF2F-test with DeepSeek-Prover-V1.5-RL solves zero additional theorems (42/244). Fixed structural diversity of 15 tactic skeletons recovers +45% relative improvement at k=16 (+12.3±4.2 theorems). Phenomenon is RL-specific and orthogonal to scaling.

ReasoningReinforcement learningBenchmarks
SIG
78
HYP
15
arXiv cs.AI·

When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

Study reveals a safety vulnerability in personalized dialogue agents: long-term memory biases intent inference and legitimizes harmful queries. PS-Bench benchmark shows personalization increases attack success rates by 15.8%–243.7% versus stateless baselines. A lightweight detection-reflection method is proposed to mitigate this safety degradation.

AI safetyAI AgentsBenchmarks
SIG
78
HYP
25
arXiv cs.AI·

Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

Neurosymbolic architecture with ontologies (Role, Domain, Interaction) for enterprise LLM agents. Controlled experiment (1,800 runs, Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B): ontology-constrained agents outperform ungrounded agents on metric accuracy and role consistency (p < .001). 2x greater lift in localized domains (Vietnam) where LLM training coverage is weak.

AI AgentsClaudeReasoning
SIG
78
HYP
25
arXiv cs.AI·

MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

MolClaw is an autonomous agent with a three-tier hierarchical architecture (70 skills) for drug molecule evaluation, screening, and optimization. It integrates 30+ specialized resources and achieves state-of-the-art performance on MolBench, a benchmark spanning 8 to 50+ sequential tool calls. Gains concentrate on structured workflow orchestration rather than ad hoc scripting.

AI AgentsMulti-agentBenchmarks
SIG
78
HYP
25
arXiv cs.AI·

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

EnactToM is an evolving benchmark with 300 multi-agent embodied tasks in 3D household environments with partial observability. It tests functional Theory of Mind—acting optimally on implicit beliefs—rather than literal belief questions. All seven frontier models score 0.0% on hard task completion, with 93% of failures traced to epistemic coordination breakdowns.

Multi-agentReasoningBenchmarks
SIG
78
HYP
25