Back to feed
arXiv cs.CL·

Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks

Signal
72
Hype
18
In three linesConv-to-Bench automatically converts multi-turn user-assistant dialogues into structured evaluation checklists for code tasks. The framework achieves Spearman correlation ρ=1.000 with BigCodeBench, with human agreement κ=0.705 for LLM-as-a-judge evaluation.
Read source
Your take?
BenchmarksCode generationEvals

Summary generated by Claude — human-verified