RSS

arXiv cs.AI

CEO-Bench evaluates agents' ability to handle complex long-horizon tasks by simulating a 500-day startup operation. The agent manages pricing, marketing, budgeting through a Python interface. Only Claude Opus 4.8 and GPT-5.5 exceed the $1M starting balance, neither consistently profitable.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·Jun 18

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

ForecastBench-Sim is a forecasting benchmark built on Freeciv game simulations. Models receive a structured game state and predict hidden future states; the benchmark continues the simulation to score forecasts. Enables questions at arbitrary time horizons, counterfactual worlds, and rare events.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 18

Searching for Synergy in Shared Workspace Human-AI Collaboration

Study of human-AI team collaboration in shared workspace using Collaborative Gym and DiscoveryBench. Adding collaborators improves performance only with coordination structure. Scaffolding combining shared memory and human-in-the-loop gates increases performance, especially in three-person teams, by clarifying responsibilities and routing expertise.

AI Agents Multi-agent Evals

SIG

HYP

arXiv cs.AI·Jun 18

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb is a benchmark of 372,648+ instances for evaluating defeasible abduction reasoning in language models. Best frontier models reach 65% under standard conditions but drop to 23.5% under rendering-robust evaluation, versus 100% for symbolic logic solvers. The benchmark includes three difficulty levels with polynomial-time verifiable gold standards.

Benchmarks Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 18

Optimizing Lithium Production Decisions under Geological, Demand, and Pricing Uncertainties: A POMDP Framework for Multi-Objective Decision Making

A POMDP framework optimizes lithium production decisions by incorporating geological, pricing, and demand uncertainties. POMDP solvers outperform human-inspired heuristics by dynamically adapting to price regimes (static, linear, exponential, stochastic) and optimally sequencing exploration, production, and technology choices.

Reasoning Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 18

What Must Generalist Agents Remember?

Theoretical paper on memory requirements for generalist agents. Proves that agents performing near-optimally across multiple domains must maintain distinct memory distributions at observational bottlenecks. Memory enables domain disambiguation, transition-model reconstruction, and planning.

AI Agents Reasoning Papers

SIG

HYP

arXiv cs.AI·Jun 18

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

ProfiLLM is an agentic LLM pipeline deployed at DiDi to extract semantic user profiles from massive behavioral logs. The system uses 27 analytical tools to mine platform-scale data and generates utility-aligned profiles, achieving +6.14% AUC improvement and +0.47% GMV gain in A/B testing.

AI Agents Llama RAG

SIG

HYP

arXiv cs.AI·Jun 18

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

WorldLines is a long-horizon embodied agent benchmark testing memory in dynamic household environments. The dataset includes temporally extended traces with dialogues, actions, and object/device state changes. ObsMem, an observer-grounded memory framework, maintains visibility-aware memories and action-native state trails for state-informed decisions.

AI Agents Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·Jun 18

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

BeliefDiffusion combines diffusion models and Model Predictive Control for navigation in partially observable environments. The framework generates multimodal belief distributions and plans efficient navigation strategies. Experiments on synthetic maps: outperforms RL and other generative approaches in success rate and path efficiency.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·Jun 18

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench is a safety evaluation benchmark for LLMs in AI4Science workflows. It covers 7 disciplines, 31 sub-disciplines, and 10 risk dimensions. The authors evaluate mainstream and science-oriented LLMs to diagnose safety gaps across risk categories.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.AI·Jun 18

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench is a benchmark to evaluate strategic reasoning in Vision-Language Models (VLMs) using real-time strategy games. Built on Beyond All Reason, it offers multi-scenario evaluations, diagnostic mini-games targeting specific competencies, and a self-evolving generation framework. Current state-of-the-art VLMs fail at multi-agent coordination and complex task scaling.

Vision Reasoning Multi-agent

SIG

HYP

arXiv cs.AI·Jun 18

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

ThinkDeception introduces a progressive reinforcement learning framework for interpretable multimodal deception detection. Using MLLMs, it converts binary classification into explicit reasoning via Chain of Thought. VAC-GRPO with curriculum learning stratified into 4 difficulty tiers achieves SOTA on mainstream benchmarks.

Reasoning Reinforcement learning Vision

SIG

HYP

arXiv cs.AI·Jun 18

Towards an Agent-First Web: Redesigning the Web for AI Agents

Paper proposing web redesign to integrate AI agents as first-class citizens across three layers: access (HTTP headers, dual human/agent content), economics (token-based model, intent-based tiers), content (ATML, cryptographic provenance chain against epistemic recursion). Ten design principles for an agent-first internet.

AI Agents Infrastructure Regulation

SIG

HYP

arXiv cs.AI·Jun 18

Analysing drivers and interdependencies in European electricity markets using XAI

Study combining deep neural networks with XAI (SHAP, SSHAP) to analyse 39 European electricity bidding zones. Identifies solar energy as disproportionate price driver, gas prices as dominant factor, and interconnections revealing interdependence of electricity markets.

Evals Papers

SIG

HYP

arXiv cs.AI·Jun 18

Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

New formal theory (HACD-H) modeling emergence of social intelligence in long-term human-AI interaction. Unified framework integrating emotional adaptation, social memory, and personality consistency. Study on 14,700 conversation turns reveals negative correlation between social intelligence and social cognitive energy (r=-0.391, p<0.001), with developmental phase-transition patterns.

Reasoning AI Agents Papers

SIG

HYP

arXiv cs.AI·Jun 18

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

Safety Reflection Pretraining inserts short safety reflections into pretraining corpora to establish self-monitoring directly in language modeling. On 1.7B models pretrained on FineWeb-Edu, the method improves safety classification accuracy and substantially reduces success rates of inference-stage and finetuning attacks.

AI safety Alignment Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 18

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

TxBench-PP is a verified benchmark evaluating AI agents on small-molecule preclinical pharmacology. 100 evaluations span mechanism-of-action, pharmacodynamics, compound-target engagement, and safety. Across 16 configurations (11 models, 4,800 trajectories), Claude Opus 4.8 achieves 59.3% success rate, GPT-5.5 55.3%. No system reliably masters these decisions.

AI Agents Benchmarks Claude

SIG

HYP

arXiv cs.AI·Jun 18

NeSyCat Torch: A Differentiable Tensor Implementation of Categorical Semantics for Neurosymbolic Learning

NeSyCat Torch unifies neurosymbolic semantics (classical, fuzzy, probabilistic, neural) under a single truth definition parametrized by monads. Implemented in PyTorch, JAX, and HaskTorch, the framework interprets computational symbols via neural networks. On MNIST addition, outperforms LTN and DeepProbLog in speed and accuracy.

Reasoning Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·Jun 18

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

CaVe-VLM-CoT is a modular agentic-RAG framework reducing VLM hallucinations through a five-stage closed-loop pipeline (Extractor, Retriever, Solver, Citation Injector, Verifier). Ungrounded claims trigger targeted re-retrieval. 23 component-wise metrics and CaVeScore measure citation faithfulness and cross-modal grounding. Results: 87.1% accuracy on ScienceQA, 55.2% on MMMU.

Vision RAG AI Agents

SIG

HYP

arXiv cs.AI·Jun 18

R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

R2D-RL bridges RoboCup 2D Soccer Simulator (RCSS2D) to Python MARL workflows via shared-memory communication. The environment supports full-field and scenario-based training with discrete/hybrid action spaces, action masks, EPV-based reward shaping, and parallel execution. Includes 11-vs-11 full-field benchmarks and baseline results.

Multi-agent Reinforcement learning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 18

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

Xcientist is a research harness that externalizes research synthesis and experimental validation for AI scientists into inspectable, contract-governed processes. It organizes literature evidence, idea states, implementation plans, and repair traces as persistent research artifacts, eliminating claim drift where runnable artifacts no longer support the originally claimed mechanism.

AI Agents Reasoning Evals

SIG

HYP

arXiv cs.AI·Jun 18

Skill-Guided Continuation Distillation for GUI Agents

SGCD, an iterative self-improvement framework, addresses off-trajectory states in GUI agents. The system first runs a plain policy, then uses a skill-guided policy to generate successful continuations. On OSWorld-Verified, SGCD improves success rates of three base models from ~30% to over 50%.

AI Agents Reinforcement learning Papers

SIG

HYP

arXiv cs.AI·Jun 18

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

Decoupled Search Grounding (DSG) decouples search from reasoning via an MCP-compatible gateway. On SimpleQA, FreshQA, and HotpotQA, DSG achieves 86.1% accuracy (vs 87.7% native) with 91% lower search cost and 68% lower latency. In production e-commerce workload, DSG cuts search cost by 98% while maintaining accuracy.

AI Agents MCP RAG

SIG

HYP

arXiv cs.AI·Jun 18

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

ARIADNE is a training-free framework for dynamic adapter selection at inference time. It represents each adapter through centroids computed from embeddings of its training set. Tested on Llama 3.2 1B across 23 NLP tasks, it recovers 97.44% of upper-bound performance and achieves 89.7% average selection accuracy on 44 tasks.

Fine-tuning Llama Benchmarks

SIG

HYP

arXiv cs.AI·Jun 18

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

Novel LLM personalization: store user facts as surgical edits in a hash-keyed memory table (Engram) instead of global LoRA. Reduces memory footprint by 33,000x, improves indirect-reasoning accuracy by 5.6x on average, and enables stacking multiple users without cross-contamination.

Fine-tuning Reasoning Papers

SIG

HYP

arXiv cs.AI·Jun 18

X+Slides: Benchmarking Audience-Conditioned Slide Generation

X+Slides is a benchmark for evaluating audience-conditioned slide generation. Built on 113 topics and 8,133 probes, it measures four metrics: Audience Coverage, Domain-wise Coverage, Efficiency, and Correctness. Tests on DeepPresenter, SlideTailor, and NotebookLM show Audience Coverage scores between 0.594 and 0.853.

Benchmarks Code generation

SIG

HYP

arXiv cs.AI·Jun 17

WallZero: Mastering the Game of WallGo with Strategic Analysis

WallZero, an AlphaZero-based agent, masters WallGo, a strategic board game popularized by Netflix's The Devil's Plan (2025). On a 7×7 board, the agent defeats professional Go players with 1.98x more territory on average. Authors analyze game fairness and identify key strategies.

Reinforcement learning Benchmarks Papers

SIG

HYP

arXiv cs.AI·Jun 17

When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval

An LLM-based self-evolving agent iteratively generates query rewriting rules to enhance BM25 for legal case retrieval. Tested on LeCaRD-v2 (Chinese benchmark), the framework outperforms baselines without parameter training by leveraging automatic evaluation and eliminating ineffective rules.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

Skill-Constrained Model Predictive Control for Resilient Manufacturing Supply Chains

Academic paper on model predictive control for manufacturing supply chains with skill constraints. Evaluates an MPC controller solving mixed-integer programs (production, inventory, training) on synthetic SkillChain-Gym scenarios. Result: no universal dominance; predictive control helps when bottlenecks are forecastable early enough for training completion.

Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

MemTrace is a benchmark evaluating long-term memory in LLM agents across three dimensions: memory age, question type (current state, earlier state, trajectory), and evidence conditions. Testing 13 configurations, the study finds that evidence use is the primary bottleneck (10× more often retrievable than missing), not retrieval itself.

AI Agents Evals Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

Clinical decision support AI system using Digital Twins, Treatment Effect estimation, and Reinforcement Learning for adaptive real-time treatment recommendations. Validated on synthetic simulator and TCGA ovarian cancer dataset. Safety module with rule-based vital sign monitoring and clinician escalation for high-uncertainty cases.

Reinforcement learning Reasoning AI safety

SIG

HYP

arXiv cs.AI·Jun 17

A Machine-Learned Comorbidity Index

Machine-Learned Comorbidity Index (MLCI) maps diagnosis codes to a single scalar by maximizing normalized Hilbert-Schmidt Independence Criterion across multiple clinical outcomes. Unlike traditional indices (Charlson, Elixhauser), MLCI captures nonlinear risk-outcome relationships and outperforms baselines on multiple EHR datasets.

Benchmarks Papers

SIG

HYP

arXiv cs.AI·Jun 17

Dissecting model behavior through agent trajectories

Study of harness-model alignment via 138k agent trajectories. Authors introduce Simple Strands Agent (SSA), a generic harness tested on Claude, Gemini, GPT, Grok, Qwen across SWE-Pro, SWE-Verified, and Terminal-Bench-2. Beyond pass@1 scores, analysis reveals fine-grained behavioral differences: edit frequency, testing activity, phase transitions.

AI Agents Benchmarks Code generation

SIG

HYP

arXiv cs.AI·Jun 17

LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline

Curriculum-grounded automated marking pipeline using LLMs to assess exam responses. Grounds model outputs in official curriculum artefacts (syllabus, performance descriptors, marking guidelines). Delivers marking outcomes comparable to human tutors with improved traceability to authorised standards.

Evals Prompt engineering Reasoning

SIG

HYP

arXiv cs.AI·Jun 17

DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack

DeepInsight is a unified evaluation infrastructure for Physical AI stacks, spanning three orders of magnitude from foundation-model decoding to full-body physics simulation. It uses three invariant abstractions (task, resource, result) to preserve regime heterogeneity while enabling cross-layer regression diagnostics impossible with federated per-segment harnesses.

Reasoning Evals Robotics

SIG

HYP

arXiv cs.AI·Jun 17

FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness

FinAcumen is a financial multimodal reasoning agent that accumulates experience from prior trajectories in persistent memory. The system improves a frozen 8B vision-language model across four financial benchmarks using selective experience activation and a deterministic tool environment for numerical computation and verification.

AI Agents Multi-agent Vision

SIG

HYP

arXiv cs.AI·Jun 17

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

SkillMigrator is an LLM agent that learns reusable web skills and transfers them across sites by matching layout structure rather than specific element references. Induced skills are stored as transferable interaction patterns (TIPs). On WebArena and Mind2Web, SkillMigrator reduces average LLM-action count by 8-10% at matched success rate.

AI Agents Code generation Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

FllumaOne: A Code-Native Multimodal CAD Dataset with Executable Programs and Kernel-Validated Feature Histories

FllumaOne is a multimodal CAD dataset of 100,000 models generated by executable Python programs in Flluma (OpenCASCADE-based CAD system). Each sample aligns the program with a feature tree, STEP representation, point cloud, and natural-language descriptions. A Qwen2.5-Coder-1.5B baseline achieves 99.98% Python syntax validity and 99.14% STEP-export validity.

Code generation Benchmarks Vision

SIG

HYP

arXiv cs.AI·Jun 17

LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings

LongWebBench is a benchmark evaluating long-horizon webpage generation by vision-language models. It contains 490 real-world pages for structural evaluation and 507 goal-oriented interaction tasks over 129 pages. Experiments show structural fidelity degrades with webpage length, and visually plausible generations often fail to support multi-step executable interactions.

Vision Benchmarks AI Agents

SIG

HYP

arXiv cs.AI·Jun 17

Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

E³RL, a reinforcement learning method, addresses error propagation in long-horizon reasoning of LLMs. Using autoregressive cross-entropy as an epistemic uncertainty signal, the model can locally correct logical defects and reuse KV cache. On AIME, 4B and 8B models outperform SOTA by 5.349% and 6.514%.

Reinforcement learning Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow

FlowRAG improves graph-based retrieval-augmented generation through a multi-granularity heterogeneous graph (passages, summaries, sentences, entities) and frequency-aware weighted flow module. This enhances semantic recall and explicit reasoning for complex multi-hop tasks.

RAG Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

Structural Preservation and the Logical Expressiveness of Graph Neural Networks

Theoretical paper establishing correspondences between GNN classes and fragments of graded modal logic. Authors characterize the logical expressiveness of GNNs preserving structural properties (embeddings, injective homomorphisms, homomorphisms) through specific fragments of existential modal logic.

Papers Reasoning

SIG

HYP

arXiv cs.AI·Jun 17

Learn to Quantify Social Interaction with Constraints for Pedestrian Walking

Method to quantify social interactions among pedestrians in long-term trajectory forecasting. Label-free probabilistic approach learning directly from trajectory observations and integrating into prediction models. Evaluated on trajectory prediction benchmarks.

Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

PreAct compiles successful runs of computer-using agents into small state-machine programs, replayed 8.5-13x faster with no per-step LLM calls. An independent evaluator validates each program before storage. Across three benchmarks (mobile, desktop, web), this verification prevents faulty program accumulation (+1.75-2.6 tasks).

AI Agents Code generation Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search

DivInit improves test-time scaling for agentic search by diversifying initial queries. Instead of sampling k independent queries in parallel, the method generates n candidates then selects k diverse seeds. Gains of 5-7 points on multi-hop QA at matched compute, validated across 5 open-weight models and 8 benchmarks.

AI Agents Reasoning Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions

SkillChain-Gym is a benchmark for reskilling-aware production-inventory control. The environment models skill decay, certification lapses, training actions, and capacity constraints. Evaluation of production-only, reactive adaptive, and static-insurance policies over 60-shift horizons with operational and resilience metrics.

Benchmarks Reinforcement learning AI Agents

SIG

HYP

arXiv cs.AI·Jun 17

Nothing from Something: Can a Language Model Discover 0?

Study on language models' ability to discover the mathematical concept of zero. GPT-2-sized models fail without additional training, but improve substantially after exposure to tens or hundreds of examples. Language pretraining reduces required examples by ~50%.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

SpeechDx is a multi-task benchmark for clinical speech AI covering 12 datasets and 27 tasks across diverse health conditions. Tasks are structured by speech production stages (conceptualization, formulation, articulation). Evaluation of 12 audio encoders shows large-scale speech models outperform domain-specific ones, but none generalize reliably across clinical speech.

Benchmarks Voice Evals

SIG

HYP

arXiv cs.AI·Jun 17

Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes

arXiv paper proposing architecture for distributed peer-to-peer autonomous agent networks. Authors identify three core mechanisms: semantic announcement propagation for collaborator discovery, verifiable identity and multi-topic reputation (MG-EigenTrust), and mechanism design for open task execution. Prototypes and simulations presented.

AI Agents Multi-agent Papers

SIG

HYP

arXiv cs.AI·Jun 17

MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

MapSatisfyBench is a benchmark for evaluating LLM agents integrated into map services. It measures their ability to identify and satisfy implicit user needs (unspoken decision factors) from real-world behavioral data. Experiments show current agents perform well on explicit task completion but struggle to proactively address implicit factors.

AI Agents Benchmarks Evals

SIG

HYP

arXiv cs.AI

CEO-Bench: Can Agents Play the Long Game?

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

Searching for Synergy in Shared Workspace Human-AI Collaboration

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

Optimizing Lithium Production Decisions under Geological, Demand, and Pricing Uncertainties: A POMDP Framework for Multi-Objective Decision Making

What Must Generalist Agents Remember?

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

Towards an Agent-First Web: Redesigning the Web for AI Agents

Analysing drivers and interdependencies in European electricity markets using XAI

Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

NeSyCat Torch: A Differentiable Tensor Implementation of Categorical Semantics for Neurosymbolic Learning

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

Skill-Guided Continuation Distillation for GUI Agents

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

X+Slides: Benchmarking Audience-Conditioned Slide Generation

WallZero: Mastering the Game of WallGo with Strategic Analysis

When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval

Skill-Constrained Model Predictive Control for Resilient Manufacturing Supply Chains

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

A Machine-Learned Comorbidity Index

Dissecting model behavior through agent trajectories

LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline

DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack

FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

FllumaOne: A Code-Native Multimodal CAD Dataset with Executable Programs and Kernel-Validated Feature Histories

LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings

Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow

Structural Preservation and the Logical Expressiveness of Graph Neural Networks

Learn to Quantify Social Interaction with Constraints for Pedestrian Walking

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search

SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions

Nothing from Something: Can a Language Model Discover 0?

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes

MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors