Back to feed
arXiv cs.AI·

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

Signal
75
Hype
15
In three linesMirrorBench is a benchmarking framework to evaluate user-proxy agents in conversational systems. It combines 6 metrics (MATTR, Yule's K, HD-D, GTEval, Pairwise Indistinguishability, Rubric-and-Reason) to measure realism of LLM-generated user utterances across 4 public datasets. Open-source code released.
Read source
Your take?
AI AgentsEvalsBenchmarksPrompt engineering

Summary generated by Claude — human-verified