Edition of2026-06-15

Broken evals day: systematic gender bias, coin-flip LLM judges, and one attempt at global standardization

By the editorial team

Today's 5 picks

Harsher on Male? Evaluating LLMs on Gender-Asymmetric Moral Framing Across Diverse Conflict Scenarios

GAMA-Bench, a benchmark of 1,298 paired scenarios, reveals systematic asymmetry: LLMs apply harsher response standards to male actors than female actors for identical misconduct. Male actors receive more punitive and blame-centered framing, while female actors receive therapeutic and empathy-oriented responses. The pattern persists across 10 models and all scenario types.

Evals AI safety Alignment

arXiv cs.CL·SIG 82

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

Reliability study of LLM-as-a-Judge: GPT-4o-mini and GPT-4.1-mini show significant instability with 13.6% average preference flips, 28% of questions exceeding 20% flip rate. Position bias detected (72% A-majority). Cross-judge agreement 76% (κ=0.51). 11 repeated trials needed for 95% confidence.

Evals GPT OpenAI

arXiv cs.AI·SIG 82

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

Every Eval Ever introduces a unified schema and community repository to standardize AI evaluation results. The system ingests 22,235 models and 2,273 benchmarks through a single JSON format, with automatic converters from popular harnesses and leaderboards. Solves fragmentation of results scattered across incompatible formats.

Evals Benchmarks Open source

arXiv cs.CL·SIG 82

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

CacheRL trains small agent models (Qwen3-4B-Thinking) achieving 92% accuracy on multi-step tool-calling tasks with 100× less compute than GPT-5 (94%). Three innovations: hybrid thinking trajectory pipeline with LLM-generated reasoning, three-tier fuzzy cache eliminating live execution costs, cache-tier-aware rewards. SFT + GRPO improve validation reward from 0.43 to 0.78.

AI Agents Reinforcement learning Reasoning

arXiv cs.LG·SIG 82

The Weight Norm Sets the Grokking Timescale: A Causal Delay Law

Causal study on grokking: the delay before generalization depends on weight norm. Under free weight decay, networks grok at a stable critical norm Wc (CV 1–2%). When norm is clamped to ρ×Wc, delay follows T_grok ∝ exp(α·ρ) with α≈7.5 (R²=0.996 across 4 moduli). Norm controls delay 19× more than learning rate.

Reasoning Papers Benchmarks