arXiv cs.AI·19 May 2026

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

Signal

Hype

In three linesMirrorBench is a benchmarking framework to evaluate user-proxy agents in conversational systems. It combines 6 metrics (MATTR, Yule's K, HD-D, GTEval, Pairwise Indistinguishability, Rubric-and-Reason) to measure realism of LLM-generated user utterances across 4 public datasets. Open-source code released.

Read source

Your take?

AI Agents Evals Benchmarks Prompt engineering

Summary generated by Claude — human-verified

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

Other angles on this story