Home News Tongyi DeepResearch is a 30B MoE model that is open-source and rivals...

Tongyi DeepResearch is a 30B MoE model that is open-source and rivals OpenAI DeepResearch.

0

Advancing from Chatbots to Fully Autonomous AI Agents

The Tongyi DeepResearch project marks a significant milestone as the first open-source web agent capable of matching the performance of leading models like Open Beyond. Our work introduces a comprehensive, rigorously tested framework for developing sophisticated autonomous agents. Central to this framework are innovations such as Agentic Continual Pre-training (CPT) and supervised fine-tuning, which together enable the creation of highly capable AI agents.

Leveraging Synthetic Data for Continuous Learning and Post-Training Enhancement

Continuous Pre-training with Diverse Data Sources

Our approach to Agentic CPT involves the systematic aggregation of data from a variety of sources, including extensive document corpora, publicly crawled datasets, and structured knowledge graphs. We also incorporate historical interaction trajectories to enrich the training material.

Data Reorganization and Question Generation: We continuously refine and restructure this data, generating complex question sets that challenge the agent’s reasoning abilities. This dynamic data curation ensures the model is exposed to a broad spectrum of scenarios.

Action Synthesis for Enhanced Decision-Making: By synthesizing both first-order and higher-order action sequences derived from diverse problem sets and past trajectories, we enable the agent to explore a vast reasoning-action space offline. This eliminates reliance on costly external API calls. Higher-order synthesis reframes trajectories as multi-step decision processes, significantly boosting the agent’s strategic planning capabilities.

Post-Training with High-Quality Synthetic QA Pairs

To push the boundaries of agent performance, we developed an end-to-end pipeline for generating synthetic question-answer pairs. This pipeline evolved from early techniques like reverse-engineering QA pairs from clickstream data to advanced graph-based synthesis methods, culminating in formal task modeling frameworks. Our current system guarantees exceptional data quality and scalability.

We construct a richly interconnected knowledge graph using random walks and isomorphic transformations to tabular data, which reduces inconsistencies in both information and reasoning structures. This formalism allows for rigorous verification of QA correctness and efficient validation of synthetic datasets.

Additionally, an automated data generator produces PhD-level research questions by leveraging a multidisciplinary knowledge base, thereby enriching the agent’s reasoning diversity.

Bootstrapping Reasoning with Structured Frameworks: Initial model capabilities are enhanced using multi-turn reasoning frameworks like ReAct, which enforce adherence to structured Thought-Action-Observation cycles, reinforcing disciplined problem-solving.

Introducing IterResearch: This novel agent paradigm dynamically constructs streamlined workspaces at each interaction step, maximizing reasoning potential. IterResearch integrates reasoning, planning, and tool use into cohesive trajectories, enabling sustained and adaptive planning in complex environments.

Flexible Interaction Modes for Enhanced Agent Performance

Native ReAct Mode

Our model excels in the native ReAct reasoning paradigm without requiring prompt engineering. It follows the Thought-Action-Observation loop through multiple iterations to resolve complex tasks. With an extended context window of up to 128K tokens, it supports extensive interaction rounds, demonstrating scalable and robust environmental engagement. This mode serves as a clear benchmark for evaluating intrinsic model capabilities and training efficacy.

Heavy Mode: Tackling Cognitive Saturation

Heavy Mode, built on the IterResearch framework, addresses challenges like cognitive overload and noise accumulation that arise when agents process all information simultaneously. In each iteration, the agent reconstructs a focused workspace containing only essential outputs from prior steps. Within this workspace, it synthesizes findings into a central evolving report and decides subsequent actions, whether gathering more data or delivering final conclusions.

Expanding on this, the Research-Synthesis framework employs multiple Research Agents working in parallel via IterResearch, whose outputs are then integrated by a Synthesis Agent. This parallelism broadens the exploration of research avenues within limited context windows, pushing agent performance to new heights.

Comprehensive End-to-End Training Pipeline for Agentic Models

Developing an agentic AI model necessitated a fundamental redesign of the training pipeline, spanning pre-training, supervised fine-tuning, and reinforcement learning (RL). We established a seamless training loop connecting Agentic CPT, Agentic Supervised Fine-Tuning (SFT), and Agentic RL, enabling continuous self-improvement aligned with complex objectives.

Innovations in Reinforcement Learning

Our RL strategy incorporates a customized on-policy Group Relative Policy Optimization algorithm, utilizing token-level gradient loss to ensure learning signals accurately reflect model capabilities. Training dynamics reveal steady reward improvements alongside sustained policy entropy, indicating ongoing exploration and avoidance of premature convergence. This robustness is attributed to the inherently dynamic web environment, which naturally fosters adaptive policies without explicit entropy regularization.

While algorithmic advances are important, our experiments highlight that data quality and training environment stability are even more critical for successful RL. Notably, training directly on human-annotated datasets like BrowseComp yields inferior results compared to synthetic data, likely due to the latter’s consistent distribution and scalability, which better supports model generalization.

Robust Infrastructure for Scalable Agent Training

  • Simulated Training Environment: To overcome the limitations of live web API dependencies, we built a simulated environment based on an offline Wikipedia database and a custom tool suite. This setup enables rapid, cost-effective, and controlled experimentation with complex tasks.
  • Reliable Tool Sandbox: A unified sandbox manages tool interactions with features like result caching, retry mechanisms, and fallback providers, ensuring deterministic and error-resilient agent experiences critical for stable learning.
  • Automated Data Curation: Real-time data optimization guided by training feedback dynamically adjusts the training set through automated synthesis and filtering, enhancing both stability and performance.
  • Asynchronous On-Policy Training Framework: Built atop the rLLM framework, multiple agent instances interact concurrently with the environment, generating diverse trajectories that enrich learning.

This integrated approach-from raw model initialization through Agentic CPT, supervised fine-tuning, and on-policy RL-establishes a new standard for training AI agents capable of tackling complex, dynamic tasks with resilience and adaptability.

Practical Deployments and Real-World Impact

  • Gaode Mate (Map & Navigation Agent): Developed in partnership with the Amap (Gaode) team, “Xiao Gao” serves as an AI copilot that leverages the app’s extensive toolset to generate detailed, optimized itineraries, surpassing conventional navigation solutions.
  • Tongyi FaRui (Legal Research Agent): Powered by the DeepResearch architecture, FaRui autonomously conducts intricate multi-step legal research akin to a junior attorney’s workflow. It systematically retrieves case law, cross-references statutes, and synthesizes analyses, all grounded in verifiable judicial sources with precise citations, ensuring professional-grade accuracy and reliability.

Current Challenges and Future Directions

Looking ahead, we aim to address several key limitations. Expanding the agent’s context window and enhancing information management strategies will be priorities. Additionally, we plan to explore partial rollout techniques, which involve overcoming challenges related to off-policy learning and distributional shifts. These advancements will further refine the agent’s reasoning depth and adaptability.

Ongoing Research and Community Contributions

The Tongyi DeepResearch initiative encompasses a broad family of deep research agents, with continuous contributions advancing the field. Over the past six months, our team has released monthly technical reports detailing innovations such as:

  • WebWalker: Benchmarking large language models for web traversal
  • WebDancer: Towards autonomous information-seeking agents
  • WebSailor: Enabling superhuman reasoning for web agents
  • WebShaper: Agentic data synthesis via formalized information-seeking
  • WebWatcher: Pioneering vision-language deep research agents
  • WebResearch: Unlocking reasoning in long-horizon agents
  • ReSum: Enhancing long-horizon search intelligence through context summarization
  • WebWeaver: Structuring web-scale evidence with dynamic outlines
  • WebSailor V2: Bridging proprietary agents with synthetic data and scalable RL
  • Scaling Agents through Continual Pre-training and Environment Scaling

We remain committed to advancing open-source AI research and invite the community to explore our Tongyi DeepResearch-30B A3B model, heralding the next generation of agentic intelligence.

Exit mobile version