Contents Overview
Alibaba’s Tongyi Lab has unveiled Tongyi-DeepResearch-30B-A3B, an open-source large language model (LLM) tailored for agent-driven, in-depth information retrieval and analysis using web-based tools. This model employs a mixture-of-experts (MoE) architecture featuring approximately 30.5 billion parameters in total, with around 3 to 3.3 billion active per token. This design achieves efficient processing speeds while maintaining robust reasoning capabilities. It is optimized for complex, multi-turn research tasks such as searching, browsing, extracting, cross-verifying, and synthesizing information through ReAct-style tool integration and enhanced test-time scaling. The release includes model weights under the Apache-2.0 license, inference scripts, and evaluation tools.
Performance Benchmarks and Evaluation
Tongyi DeepResearch demonstrates leading-edge performance on several agentic search benchmarks commonly used to evaluate deep research capabilities:
- Humanity’s Last Exam (HLE): 32.9,
- BrowseComp: 43.4 (English) and 46.7 (Chinese),
- xbench-DeepSearch: 75,
Additionally, it shows competitive results on WebWalkerQA, GAIA, FRAMES, and SimpleQA datasets. The developers report that Tongyi-DeepResearch matches or surpasses the performance of proprietary and open-source agents, including those developed by OpenAI, across these challenging tasks.
Model Architecture and Inference Capabilities
- MoE Routing Inspired by Qwen3-MoE: Incorporates roughly 30.5 billion parameters with about 3.3 billion active per token, delivering the efficiency of a smaller dense model while leveraging expert specialization.
- Extended Context Window: Supports up to 128,000 tokens, enabling prolonged browsing sessions and iterative information synthesis.
- Two Inference Modes:
- ReAct Mode: Native mode for evaluating intrinsic reasoning and tool interaction.
- IterResearch “Heavy” Mode: A test-time scaling approach that structures multi-round synthesis and context reconstruction to minimize error propagation and noise accumulation.
Training Methodology: Synthetic Data and Reinforcement Learning
Tongyi DeepResearch is developed as a fully autonomous agent rather than a conventional chat model, utilizing a scalable, automated data generation pipeline:
- Agentic Continual Pre-Training (CPT): Utilizes large-scale synthetic trajectories derived from curated datasets, historical tool usage logs, and graph-based knowledge to enhance retrieval, browsing, and multi-source data fusion skills.
- Agentic Supervised Fine-Tuning (SFT): Employs ReAct and IterResearch formatted trajectories to instill consistent planning and tool utilization schemas.
- On-Policy Reinforcement Learning: Implements Group Relative Policy Optimization (GRPO) with token-level policy gradients, leave-one-out advantage estimation, and negative-sample filtering to stabilize learning in dynamic web environments.
Application in Document and Web-Based Research
Deep research tasks demand four critical competencies: (1) long-term strategic planning, (2) iterative retrieval and cross-verification from multiple sources, (3) meticulous evidence tracking with minimal hallucinations, and (4) synthesis over extensive contexts. The IterResearch rollout refines context after each iteration by preserving only essential information, effectively reducing context overload and error accumulation. Meanwhile, the ReAct baseline confirms that these behaviors are learned intrinsically rather than engineered through prompts. The model’s strong performance on HLE and BrowseComp benchmarks indicates enhanced robustness in handling multi-hop, tool-assisted queries, overcoming limitations seen in previous agents that often overfit prompt patterns or plateau at shallow reasoning depths.
Distinctive Attributes of Tongyi DeepResearch-30B-A3B
- Efficient MoE Architecture: Approximately 30.5 billion parameters with 3.0-3.3 billion active per token, combining the inference cost of smaller models with the capacity of larger ones.
- Massive 128K Token Context Window: Facilitates extended, multi-step web research with comprehensive evidence accumulation.
- Dual Inference Frameworks: Native ReAct for direct tool-use evaluation and IterResearch “Heavy” mode for deeper, multi-round synthesis during test time.
- Automated Agentic Data Pipeline: Fully automated system supporting continual pre-training, supervised fine-tuning, and reinforcement learning.
- Advanced On-Policy RL with GRPO: Incorporates token-level policy gradients, leave-one-out advantage estimation, and selective negative-sample filtering to ensure stable learning.
- State-of-the-Art Benchmark Scores: HLE 32.9, BrowseComp 43.4 (English) / 46.7 (Chinese), xbench-DeepSearch 75, alongside strong results on WebWalkerQA, GAIA, FRAMES, and SimpleQA.
Conclusion
Tongyi DeepResearch-30B-A3B integrates a sophisticated MoE architecture (~30 billion parameters total, ~3 billion active), an expansive 128K token context window, dual inference modes (ReAct and IterResearch), and a comprehensive automated training pipeline featuring GRPO-based reinforcement learning. This open-source framework offers a practical and cost-effective solution for teams developing long-horizon research agents, delivering competitive performance on demanding deep-research benchmarks.
