Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks
Signal
72
Hype
18
In three linesConv-to-Bench automatically converts multi-turn user-assistant dialogues into structured evaluation checklists for code tasks. The framework achieves Spearman correlation ρ=1.000 with BigCodeBench, with human agreement κ=0.705 for LLM-as-a-judge evaluation.Read source
Your take?
Summary generated by Claude — human-verified