Tongyi DeepResearch Technical Report
In three linesTongyi DeepResearch is an agentic LLM with 30.5 billion parameters (3.3 billion activated per token) designed for long-horizon deep research tasks. Trained via agentic mid-training and post-training with automatic data synthesis, it achieves state-of-the-art on 7 benchmarks including Humanity's Last Exam and BrowseComp. Model and framework are open-sourced.
## Tongyi DeepResearch: anatomy of a 30.5B-parameter research agent
### 1. What's being announced
Alibaba Cloud releases the technical report for Tongyi DeepResearch, an agentic MoE LLM with **30.5 billion total parameters, only 3.3 billion activated per token**. The sparse architecture allows running a 30B-class model at inference costs close to a dense 3B. The model, training framework, and complete benchmark solutions are open-sourced — uncommon for a system reaching this performance level on agentic tasks.
Training follows a two-phase pipeline: **agentic mid-training** (adapting the base model to agentic behaviors) followed by **agentic post-training** (alignment and reinforcement on long-horizon research tasks). Data synthesis is fully automatic, with no human annotation, making the pipeline scalable and reproducible.
### 2. The numbers that matter
Tongyi DeepResearch claims state-of-the-art on **7 benchmarks simultaneously**:
- **Humanity's Last Exam (HLE)**: the community's most discriminating benchmark, designed to resist current LLMs with expert-level questions across dozens of domains. Reaching SOTA here means surpassing GPT-4o, Claude 3.5 Sonnet, and prior DeepSeek-R1 versions on tasks requiring multi-step reasoning and information retrieval. - **BrowseComp and BrowseComp-ZH**: OpenAI benchmarks measuring the ability to navigate the web to answer complex questions. The ZH (Chinese) version indicates genuine multilingual coverage, not cosmetic. - **WebWalkerQA**: multi-hop web navigation with structured information extraction. - **FRAMES**: factual reasoning with multi-document retrieval. - **xbench-DeepSearch and xbench-DeepSearch-2510**: two versions of Alibaba's internal benchmark, one dated October 2025, suggesting continuous evaluation on recent data.
The absence of absolute figures in the abstract (exact scores not disclosed here) is standard for arXiv technical reports — full tables are in the paper body.
### 3. Why the MoE architecture changes the calculus
Before this announcement, the open-source research agent landscape was dominated by dense models (Llama 3.1 70B, Qwen2.5 72B) or proprietary systems (Perplexity, You.com, Gemini and ChatGPT Deep Research modes). The 30.5B total / 3.3B active ratio puts Tongyi DeepResearch in a distinct category: **small-model inference cost, large-model capability**.
In practice, for self-hosted deployment, a server with 2× A100 80GB can run inference where a dense 30B would require the same hardware but with significantly higher per-token latency. For teams building automated research pipelines, this represents a meaningful operational cost shift.
### 4. Potential losers and limitations
**Perplexity and proprietary AI search engines** are the most exposed. If an open-source model of this class genuinely achieves SOTA on BrowseComp — the benchmark designed precisely to evaluate deep web research — the value proposition of proprietary APIs at $20/month erodes.
**Teams using classical RAG pipelines** (retrieve-then-read with non-specialized models) will need to justify their architecture against a model trained end-to-end for information seeking.
**Limitations to watch**: the report mentions customized environments for each training stage — implying non-trivial infrastructure complexity to reproduce training. Open-sourcing weights does not guarantee full pipeline reproducibility. Additionally, xbench benchmarks are internal to Alibaba, introducing potential evaluation bias. Performance on HLE and BrowseComp (third-party benchmarks) is more convincing, but absolute scores need verification in the full paper.
The model being optimized for **long-horizon** tasks (multi-step research sessions, not one-shot queries) also means short benchmarks don't necessarily capture its real value — and conversely, its performance on short tasks may be suboptimal compared to generalist models.
Summary generated by Claude — human-verified