Back to feed
arXiv cs.AI·

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

Signal
75
Hype
25
In three linesLongAct is a benchmark for evaluating autonomous planning in long-horizon household tasks specified via free-form instructions. HoloMind, a VLM-driven agent with DAG-based hierarchical planner, Multimodal Spatial Memory, and Episodic Memory, achieves 59% goal completion and 16% full-task success with GPT-5 and Qwen3-VL models.
Read source
Your take?
BenchmarksAI AgentsReasoningVision

Summary generated by Claude — human-verified