Back to feed
arXiv cs.CL·

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

Signal
82
Hype
25
In three linesCacheRL trains small agent models (Qwen3-4B-Thinking) achieving 92% accuracy on multi-step tool-calling tasks with 100× less compute than GPT-5 (94%). Three innovations: hybrid thinking trajectory pipeline with LLM-generated reasoning, three-tier fuzzy cache eliminating live execution costs, cache-tier-aware rewards. SFT + GRPO improve validation reward from 0.43 to 0.78.
Read source
Your take?
AI AgentsReinforcement learningReasoningFine-tuningBenchmarks

Summary generated by Claude — human-verified