Back to feed
arXiv cs.CL·

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

Signal
78
Hype
15
In three linesLLMEval-Logic is a Chinese logical reasoning benchmark with 246 base items and 190 hard items, verified by Z3 and expert-audited. Evaluation of 14 frontier LLMs: best score 37.5% on hard items, 60.16% on Z3+rubric formalization.
Read source
Your take?
BenchmarksReasoningEvals

Summary generated by Claude — human-verified