Back to feed
arXiv cs.CL·

LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

Signal
78
Hype
15
In three linesLongMINT is a benchmark evaluating agents' memory management in long contexts (up to 1.8M tokens) with multi-target interference. 15.6k QA pairs across 4 domains (state tracking, dialogue, Wikipedia revisions, GitHub commits). 7 systems tested (long-context LLMs, RAG, agent frameworks) achieve 27.9% average accuracy, bottlenecked by retrieval and memory construction.
Read source
Your take?
AI AgentsBenchmarksRAGReasoning

Summary generated by Claude — human-verified