Page 13 of 137

AllHigh signalRecent
5480 articles
arXiv cs.AI·

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

COSMO-Agent, a tool-augmented RL framework, trains LLMs to orchestrate iterative CAD-CAE processes. The system learns to generate parametric geometry, solve simulations, and revise designs under multiple constraints. Industry-aligned dataset covering 25 component categories. Trained small LLMs outperform large open-source and closed-source models in feasibility and stability.

AI AgentsReinforcement learningTools
SIG
78
HYP
25
arXiv cs.LG·

Don't Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins

GOEN (Geometry-Optimised Epistemic Network) combines multi-scale features, L2 normalisation, Mahalanobis distance, and calibration to detect out-of-distribution inputs. Key finding: CenterLoss degrades OOD detection (AUROC 0.9366 vs 0.9483 without), despite improving classification accuracy. GOEN-NoCenterLoss achieves 0.9483 AUROC on CIFAR-10, outperforming deep ensembles (0.8827), KNN (0.8967), and ODIN (0.8870).

AI safetyEvalsBenchmarks
SIG
78
HYP
25
arXiv cs.LG·

Embedding-Based Federated Learning with Runtime Governance for Iron Deficiency Prediction

Real-world deployment of federated learning pipeline for iron deficiency prediction from full blood count data. Uses DeepCBC (frozen haematology foundation model) + FedMAP (personalised aggregation). Tested across two clinical sites (AUMC, NHSBT) with non-IID data. FedMAP improves ROC-AUC from 0.947→0.959 (AUMC) and 0.856→0.867 (NHSBT) versus local-only training.

EmbeddingsBenchmarks
SIG
78
HYP
15
arXiv cs.AI·

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

ScenePilot generates critical scenarios for autonomous driving testing via multi-objective reinforcement learning. The framework combines RSS-derived physical feasibility with an AV-risk predictor to target boundary-band scenarios: physically solvable yet causing failures. Results: +6.2 percentage points collision rate on SafeBench while preserving physical validity.

Reinforcement learningAI safetyEvals
SIG
78
HYP
15
arXiv cs.AI·

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

DeepWeb-Bench is a deep research benchmark evaluating 9 frontier models on tasks requiring massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. Errors stem primarily from derivation and calibration (>70%), not retrieval (12-14%). Strong and weak models fail differently: incomplete derivation vs hallucinated precision.

BenchmarksReasoningAI Agents
SIG
78
HYP
25
arXiv cs.LG·

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

Authors show teacher-token reliability in reasoning self-distillation depends on position within trajectory, not local entropy. They propose Position-Weighted OPSD (PW-OPSD), applying increasing position weights to token supervision. On Qwen3-4B, AIME 2024/2025 improve by +1.0/+1.1 points; validation on DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think confirms gains.

ReasoningFine-tuningBenchmarks
SIG
78
HYP
15
arXiv cs.LG·

Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

Weasel is a trajectory selection method for offline training of web agents. It optimizes a balance between importance and diversity across states, websites, and interaction patterns, with target-centered AXTree pruning. On WebArena, WorkArena, and MiniWob, it improves out-of-domain generalization with 9.7-12.5× training speedups over standard fine-tuning on Qwen2.5-7B, Gemma3-4B, and Qwen3-8B.

AI AgentsFine-tuningBenchmarks
SIG
78
HYP
18