Three papers published the same day attack the same problem from different angles: reliable LLM evaluation remains unsolved. GAMA-Bench (arXiv:2606.14068, 1,298 paired scenarios, 10 models) documents a persistent moral judgment asymmetry — male actors consistently receive more punitive framing than female actors for identical behavior. This is not a prompt artifact: the pattern holds across all scenario types tested. Simultaneously, the LLM-as-a-Judge study (arXiv:2606.13685) measures that GPT-4o-mini and GPT-4.1-mini flip their preference in 13.6% of cases on average, with 28% of questions exceeding a 20% flip rate and a position bias at 72% A-majority. Practical implication: 11 repeated trials are required to reach 95% confidence on a single judgment. Anyone using these models as sole judges in an eval or RLHF pipeline is introducing unquantified structural noise.
Every Eval Ever (arXiv:2606.14516) offers an infrastructural response: a unified JSON schema ingesting 22,235 models and 2,273 benchmarks, with converters from existing harnesses. Its utility depends entirely on community adoption — the Hugging Face repository is open, but eval format fragmentation is a coordination problem as much as a technical one. Worth watching whether major labs contribute or ignore it.
On the agents side, CacheRL (arXiv:2606.14179) is the most actionable result of the day: Qwen3-4B-Thinking trained with SFT + GRPO reaches 92% on multi-step tool-calling tasks versus GPT-5's 94%, at 100× less compute. The three-level fuzzy cache eliminates live executions during training, making the pipeline reproducible without expensive sandbox environments. Validation reward moves from 0.43 to 0.78. For teams fine-tuning agents on specific tool-use tasks, this is a concrete architectural reference.
GAMA-Bench, a benchmark of 1,298 paired scenarios, reveals systematic asymmetry: LLMs apply harsher response standards to male actors than female actors for identical misconduct. Male actors receive more punitive and blame-centered framing, while female actors receive therapeutic and empathy-oriented responses. The pattern persists across 10 models and all scenario types.
Reliability study of LLM-as-a-Judge: GPT-4o-mini and GPT-4.1-mini show significant instability with 13.6% average preference flips, 28% of questions exceeding 20% flip rate. Position bias detected (72% A-majority). Cross-judge agreement 76% (κ=0.51). 11 repeated trials needed for 95% confidence.
Every Eval Ever introduces a unified schema and community repository to standardize AI evaluation results. The system ingests 22,235 models and 2,273 benchmarks through a single JSON format, with automatic converters from popular harnesses and leaderboards. Solves fragmentation of results scattered across incompatible formats.
CacheRL trains small agent models (Qwen3-4B-Thinking) achieving 92% accuracy on multi-step tool-calling tasks with 100× less compute than GPT-5 (94%). Three innovations: hybrid thinking trajectory pipeline with LLM-generated reasoning, three-tier fuzzy cache eliminating live execution costs, cache-tier-aware rewards. SFT + GRPO improve validation reward from 0.43 to 0.78.
Causal study on grokking: the delay before generalization depends on weight norm. Under free weight decay, networks grok at a stable critical norm Wc (CV 1–2%). When norm is clamped to ρ×Wc, delay follows T_grok ∝ exp(α·ρ) with α≈7.5 (R²=0.996 across 4 moduli). Norm controls delay 19× more than learning rate.