The Sequence Chat #814: Z.ai's Zixuan Li Talks About GLM
Was this email forwarded to you? Sign up here We are back with our interview series and a very special guest today! Zixuan Li has been at the center of the China AI open source revolution. We discuss the GLM models, Chinese open source and more. Background
I’m Zixuan Li, Head of Z.ai Global Ecosystem, responsible for Z.ai Chat, Z.ai API services, global partnerships, and global branding. It was an easy choice for me. I’m not a pure academic person. I’ve done entrepreneurship before, worked at large tech firms, and hold CFA and PMP certifications. I love challenges. When I joined Z.ai, I saw fascinating challenges that I had a chance to conquer: product development, partnership building, commercialization, and establishing a global brand. The opportunity to build something from the ground up on a global scale was too compelling to pass up. The Main Sequence
The original hypothesis behind GLM was that the dichotomy between autoencoding (BERT-style) and autoregressive (GPT-style) models was a false choice. We believed a unified framework could capture the best of both worlds: strong bidirectional understanding and powerful generation capabilities. Back then, the landscape was fragmented. You had to choose between models good at understanding versus models good at generation. GLM’s autoregressive blank infilling objective was designed to bridge that gap. This DNA still influences our models today. We continue to prioritize versatility and multi-task capability rather than optimizing for a single benchmark or use case.
The GLM model architecture today has evolved significantly from its previous versions. We cannot comment extensively on Western models since most are closed-source. We simply cannot see their architectures. Throughout our development process, we continuously absorb industry best practices while making our own innovations. The key point is to keep identifying new critical problems and solving them. For us, there is no unified “best practice.” The field is moving too fast, and what works best depends heavily on the specific challenges you’re trying to solve.
We see MoE as a superior path to reasoning, not merely an efficiency optimization. The sparse activation pattern in MoE architectures allows different experts to specialize in different types of knowledge and reasoning patterns. This specialization leads to more nuanced and accurate responses across diverse domains.
We want to expand the cake before taking a bite of it. With open source, we aim to achieve three things: Improve accessibility: You can download the model yourself or use it from various inference providers. This lowers barriers for developers and researchers worldwide. Enable ecosystem innovation: You can build your own models on top of the GLM series. Quantize, finetune, extend. We’ve seen remarkable work extending our original capabilities, like Intellect-3. The community often takes our models in directions we hadn’t anticipated. Shape standards: If we’re fortunate enough, we might help set some norms and standards for open models. That’s something we simply cannot achieve with closed-source models alone. Critically, open-sourcing does not cannibalize our business. Demand for GLM has exceeded supply. The whole world now lacks enough compute to run all the GLM deployments people want. Open source builds trust, expands the ecosystem, and ultimately drives more enterprise customers to our managed services.
To assess the product-market fit of device agents, the critical point is not to compare local versus cloud solutions. It’s to compare the agent’s operation with human operation. Currently, we see three main obstacles: Speed of operations: Agents must match or exceed the speed at which humans navigate interfaces. A 2-second delay per action compounds quickly into an unusable experience. Error recovery and robustness: Humans are incredibly good at recovering from small mistakes. We barely notice when we misclick and correct. Agents need this same resilience. A single error that derails an entire workflow breaks user trust immediately. Context persistence across sessions: Humans remember what we were doing yesterday. We pick up tasks seamlessly. Agents need similar long-term context awareness to feel truly integrated into daily life. Latency is important, but I’d argue error-correction and graceful degradation are the bigger blockers today. Users can tolerate a slightly slower agent that reliably completes tasks. They cannot tolerate a fast agent that fails unpredictably.
Vision capabilities are still essential in many scenarios. For example: Physical world understanding: Text can describe that “water flows downhill,” but video data showing fluid dynamics teaches intuitive physics in ways text descriptions cannot fully capture. Real-world grounding: Many real-world applications, such as medical imaging, autonomous driving, and industrial inspection, simply cannot function without visual input. That said, I don’t think the question is binary. Text contains vast amounts of implicit physical knowledge encoded in how humans describe the world. The most capable systems will likely combine both, using text to provide abstract reasoning frameworks and visual data to ground those frameworks in physical reality. The question isn’t “text or vision” but “how do we best integrate multiple modalities for richer understanding?”
Z.ai is relatively more mature in “service” and more committed to the model-as-a-service (MaaS) philosophy. We don’t just provide models. We’ve developed a mature enterprise and individual service system. This means we not only serve customers on our own API platform but also assist other providers in deploying GLM models correctly and efficiently. We help with optimization, integration, and ongoing support. That comprehensive service approach has driven our commercial success across both B2C and B2B segments. Our soul, if I had to distill it: we’re builders who serve builders. We care deeply about developer experience, enterprise reliability, and the entire journey from model to production application.
GLM-5 scales from 355B parameters in GLM-4.5 to 744B total parameters, with 40B active per inference through a Mixture of Experts architecture. To keep deployment practical at this size, it integrates DeepSeek Sparse Attention for the first time, maintaining long-context capacity while significantly reducing memory and compute costs. The result is a model designed around intelligence efficiency, not raw parameter count. Miscellaneous
I won’t comment on other labs in China. But for Z.ai, we don’t really face this “China vs. Silicon Valley” framing. Most of our researchers are Z.ai-originated. They started their research journey here, growing with the company from early days. What attracts talent to us? I’d say it’s the combination of cutting-edge research opportunity, the pace of iteration, and the chance to see your work deployed at massive scale very quickly. At Z.ai, the distance from research idea to production deployment is remarkably short. For researchers who want to see their work matter in the real world, that’s compelling.
It’s hard to give a precise definition of AGI. The goalposts keep moving as capabilities advance. But I believe the current Transformer architecture has a very high ceiling, higher than most people expect. Here’s a perspective that’s often overlooked: most of the data patterns we’ll see in 2027 and 2028 haven’t even been created yet. Human knowledge, content, and interaction patterns are continuously expanding. The models of tomorrow will be trained on data that doesn’t exist today. That’s a powerful tailwind for continued scaling. Will we eventually need architectural breakthroughs? Probably. But I suspect we’re nowhere near the ceiling of what’s possible with Transformers. We’re still in the early innings of understanding how to fully leverage this architecture.
Pricing by tokens might no longer be the mainstream business model. LLMs will be charged by the value they create. Think about it: we don’t pay for electricity based on how many electrons flow through our devices. We pay for what those electrons enable us to do. Similarly, as AI becomes more agentic and outcome-oriented, pricing will shift from input metrics (tokens) to output metrics (tasks completed, value generated, problems solved). This is already beginning with subscription models and outcome-based enterprise contracts. You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Similar newsletters
There are other similar shared emails that you might be interested in:


