arXiv cs.CL·15 June 2026

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

Signal

Hype

In three linesCacheRL trains small agent models (Qwen3-4B-Thinking) achieving 92% accuracy on multi-step tool-calling tasks with 100× less compute than GPT-5 (94%). Three innovations: hybrid thinking trajectory pipeline with LLM-generated reasoning, three-tier fuzzy cache eliminating live execution costs, cache-tier-aware rewards. SFT + GRPO improve validation reward from 0.43 to 0.78.

Read source

Your take?

AI Agents Reinforcement learning Reasoning Fine-tuning Benchmarks

Summary generated by Claude — human-verified

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

Other angles on this story