arXiv cs.AI·19 May 2026

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

Signal

Hype

In three linesLongAct is a benchmark for evaluating autonomous planning in long-horizon household tasks specified via free-form instructions. HoloMind, a VLM-driven agent with DAG-based hierarchical planner, Multimodal Spatial Memory, and Episodic Memory, achieves 59% goal completion and 16% full-task success with GPT-5 and Qwen3-VL models.

Read source

Your take?

Benchmarks AI Agents Reasoning Vision

Summary generated by Claude — human-verified

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

Other angles on this story