The Sequence Radar #723: Alibaba’s Agentic Leap: Why Tongyi DeepResearch Matters
Was this email forwarded to you? Sign up here The Sequence Radar #723: Alibaba’s Agentic Leap: Why Tongyi DeepResearch MattersAnother Chinese lab releasing impressive models.Next Week in The Sequence:Subscribe Now to Not Miss Anything:📝 Editorial: Alibaba’s Agentic Leap: Why Tongyi DeepResearch MattersTongyi DeepResearch matters because it’s the first fully open‑source “deep research” web agent that publicly claims parity with top closed systems across a broad suite of agentic browsing benchmarks—yet ships under a permissive license with reproducible code and weights. For teams that need verifiable pipelines and on‑prem deployment, this flips the script: the research loop (data → training → evaluation → inference) is documented end‑to‑end and legally usable in products, not just demos. The release also raises the bar on what “agentic” means in practice: robust long‑horizon browsing, test‑time scaling, and an RL‑trained policy rather than fragile prompt glue. Quick background on Tongyi: the project comes from Alibaba’s Tongyi Lab (adjacent to the Qwen stack) and is derived from a 30.5B Mixture‑of‑Experts architecture with ~3.3B parameters active per token (“A3B”), giving it the efficiency profile of a small model while keeping a larger expert pool for reasoning. Context length is listed at 128k. The weights and training/inference code are available under Apache‑2.0 with ready‑to‑run scripts. If you’ve used Qwen3‑30B‑A3B models, the ergonomics will feel familiar; this is a specialized agentic fork aimed at long‑horizon information seeking. Technically, the standout is the fully automated synthetic‑data flywheel that spans continual pre‑training (Agentic CPT), supervised fine‑tuning, and strictly on‑policy reinforcement learning. The team describes a Group‑Relative Policy Optimization variant with token‑level policy gradients and leave‑one‑out advantages to stabilize training in a non‑stationary web environment—paired with automated negative‑sample filtering. Inference supports two regimes: a vanilla ReAct path to audit core capabilities, and a “Heavy” mode (test‑time scaling) that layers iterative planning to push performance ceilings. This combination—purpose‑built synthetic data + on‑policy RL + selectable inference regimes—is the core engineering contribution. On empirical results, Tongyi DeepResearch reports state‑of‑the‑art or parity scores on major agentic browsing suites: 32.9 on Humanity’s Last Exam (HLE), 43.4 on BrowseComp, 46.7 on BrowseComp‑ZH, and 75 on xBench‑DeepSearch, with additional wins across WebWalkerQA and related sets. The claim is that it systematically outperforms existing proprietary and open‑source “deep research” agents in the reported settings. As always, caveats apply—benchmarks vary in tooling, retries, and orchestration—but the breadth of public numbers and open weights make third‑party replication feasible. Design‑wise, the rollout loop emphasizes “synthesis and reconstruction”: after each browsing cycle the agent distills essential artifacts into a compact workspace and a continually evolving central report before deciding to gather more evidence or finalize an answer. Beyond benchmarks, Alibaba lists live deployments—e.g., “Xiao Gao” in Amap (Gaode) for multi‑step travel planning and a legal research agent (FaRui) that grounds outputs in verifiable citations—use cases that stress tooling orchestration and citation hygiene, not just token‑level reasoning. Why this release is significant: it operationalizes an open, reproducible recipe for long‑horizon agents—data generation → training → RL → inference—that enterprises can inspect, fork, and harden, rather than treating agents as a prompt template on top of a closed API. The Apache‑2.0 licensing and full‑stack availability lower adoption friction; the MoE‑A3B efficiency makes serious research loops economically plausible; and the explicit limitations (context still capped, scaling to larger backbones pending, RL efficiency to improve) give a credible roadmap for community contributions. In short, Tongyi DeepResearch resets expectations for what a “serious” open agent looks like—and gives practitioners something they can run, measure, and ship today. 🔎 AI ResearchTitle: The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMsAI Lab: University of Cambridge; Institute for AI, University of Stuttgart; Max Planck Institute for Intelligent Systems; ELLIS Institute; University of Southampton; Tübingen AI Center. Summary: The paper shows that small gains in single-step accuracy compound into large—and even faster-than-exponential—improvements in the task length models can execute, and identifies “self-conditioning” (models amplifying their own past mistakes) as a key failure mode in long-horizon execution. Thinking models and test-time sequential compute mitigate self-conditioning and dramatically extend single-turn execution length, with frontier reasoning models outperforming non-thinking counterparts by large margins. Title: Virtual Agent EconomiesAI Lab: Google DeepMind (with contributors from University of Toronto). Summary: The authors propose “sandbox economies” for AI agents—intentional, steerable markets with controllable permeability to the human economy—to harness coordination benefits while managing systemic risk. They outline design tools such as auctions for fair allocation, mission economies, and trust infrastructure (e.g., verifiable credentials) to build safe, accountable, and socially aligned agent markets. Title: Towards General Agentic Intelligence via Environment ScalingAI Lab: Tongyi Lab, Alibaba Group. Summary: The paper introduces a scalable pipeline that programmatically builds fully simulated tool-use environments and then trains agents via a two-stage experience-learning regimen (general tool use → domain specialization), yielding verifiable trajectories. Experiments on τ-bench/τ²-Bench/ACEBench show the AgentScaler models significantly improve function-calling capability and reach parity with much larger/closed models in several cases. Title: WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement LearningAI Lab: Tongyi Lab, Alibaba Group. Summary: WebSailor-V2 pairs a new SailorFog-QA-2 dataset (dense graph-based uncertainties) with a dual-environment RL setup (fast simulator + robust managed real web) to train a 30B-A3B MOE agent. The system achieves SOTA on BrowseComp-EN/ZH and HLE, surpassing prior open-source agents and rivaling proprietary systems. Title: Scaling Laws for Differentially Private Language ModelsAI Lab: Google Research & Google DeepMind. Summary: This work derives compute–privacy–utility scaling laws for DP LLM training, providing prescriptions for how to allocate compute among model size, batch size, and iterations under fixed privacy/data budgets. A key finding is that DP-optimal configurations favor much smaller models and very large batches, with increasing compute yielding little benefit unless accompanied by more privacy budget or data. Title: Tool-space interference in the MCP era: Designing for agent compatibility at scaleAI Lab: Microsoft Research Summary: The blog highlights how the rapid adoption of the Model Context Protocol (MCP) has created a thriving ecosystem of interoperable tools, but also introduced “tool-space interference,” where multiple agents and tools working together can inadvertently reduce effectiveness. Microsoft researchers propose early design strategies to mitigate these issues, enabling heterogeneous agents to cooperate at scale rather than hinder one another. 🤖 AI Tech ReleasesTongyi DeepResearchAlibaba Tongyi open sourced a new autonomous research agent. IBM Granite DoclingIBM released Granite Docling, a foundation model optimized for document understanding. 📡AI Radar
You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Similar newsletters
There are other similar shared emails that you might be interested in: