Back to feed
arXiv cs.AI·

LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

Signal
78
Hype
15
In three linesLongMINT is a benchmark evaluating agents' memory management in long contexts (up to 1.8M tokens) with multi-target interference. 15.6k QA pairs across 4 domains (state tracking, dialogue, Wikipedia revisions, GitHub commits). 7 systems tested (LLMs, RAG, agents) achieve 27.9% average accuracy, limited by retrieval and memory construction.
Read source
Your take?
AI AgentsBenchmarksRAGReasoning

Summary generated by Claude — human-verified