arXiv cs.CL·28 May 2026

Disentangling Language Roles in Multilingual LLM Task Execution

Signal

Hype

In three linesMTM-Bench, a controlled benchmark for multilingual task execution, evaluates 20 LLMs across 27 language triplets (instruction/content/response) in English, Spanish, and Chinese. Results show degradation is organized by language role in task structure, with response language as the dominant axis of variation.

Read source

Your take?

Benchmarks Evals

Summary generated by Claude — human-verified

Disentangling Language Roles in Multilingual LLM Task Execution

Other angles on this story