Back to feed
arXiv cs.AI·

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

Signal
72
Hype
25
In three linesConsumerSimBench, a benchmark built from 1,553 Chinese social-media topics and 23,122 reaction criteria, evaluates whether LLMs can reconstruct real consumer reaction patterns. Gemini-3.1-Pro covers only 47.8% of criteria, revealing a major gap between technical performance and consumer intuition. A multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6%.
Read source
Your take?
BenchmarksEvalsMulti-agentReasoning

Summary generated by Claude — human-verified