Edition of2026-06-15

Broken evals day: systematic gender bias, coin-flip LLM judges, and one attempt at global standardization

Three papers published the same day attack the same problem from different angles: reliable LLM evaluation remains unsolved. GAMA-Bench (arXiv:2606.14068, 1,298 paired scenarios, 10 models) documents a persistent moral judgment asymmetry — male actors consistently receive more punitive framing than female actors for identical behavior. This is not a prompt artifact: the pattern holds across all scenario types tested. Simultaneously, the LLM-as-a-Judge study (arXiv:2606.13685) measures that GPT-4o-mini and GPT-4.1-mini flip their preference in 13.6% of cases on average, with 28% of questions exceeding a 20% flip rate and a position bias at 72% A-majority. Practical implication: 11 repeated trials are required to reach 95% confidence on a single judgment. Anyone using these models as sole judges in an eval or RLHF pipeline is introducing unquantified structural noise.

Every Eval Ever (arXiv:2606.14516) offers an infrastructural response: a unified JSON schema ingesting 22,235 models and 2,273 benchmarks, with converters from existing harnesses. Its utility depends entirely on community adoption — the Hugging Face repository is open, but eval format fragmentation is a coordination problem as much as a technical one. Worth watching whether major labs contribute or ignore it.

On the agents side, CacheRL (arXiv:2606.14179) is the most actionable result of the day: Qwen3-4B-Thinking trained with SFT + GRPO reaches 92% on multi-step tool-calling tasks versus GPT-5's 94%, at 100× less compute. The three-level fuzzy cache eliminates live executions during training, making the pipeline reproducible without expensive sandbox environments. Validation reward moves from 0.43 to 0.78. For teams fine-tuning agents on specific tool-use tasks, this is a concrete architectural reference.

Today's 5 picks
01
02
03
04
05