The Sequence Radar #795: The New Inference Kids
Was this email forwarded to you? Sign up here Next Week in The Sequence:
Subscribe and don’t miss out:📝 Editorial: The New Inference KidsLast week was all about inference in AI and new players emerging as forces to be reckoned with in the space. For the last few years, the entire industry has been obsessed with training—stacking thousands of H100s to teach a ghost how to speak. But this week, the vibe shifted. We are moving from a world where we spend billions to create intelligence, to one where we spend billions to serve it. The race is no longer just about who has the smartest model; it’s about who can actually run the thing without bankrupting the company. The “Inference Race” just went vertical. Leading the charge is BaseTen, who just announced a monster $300M round at a ~$5B valuation. Interestingly, NVIDIA is writing the check. BaseTen isn’t trying to build the model; they are building the plumbing. Their bet is that inference is the new “cloud computing”—a utility that needs to be boring, reliable, and infinitely scalable. They are effectively saying: “You bring the weights, we’ll handle the nightmare of GPU orchestration.” While BaseTen handles the macro-infrastructure, two other players emerged this week to handle the micro-optimization. First up, we have RadixArk. If you’ve been hacking on SGLang, you know it’s magic for complex workflows. The team coming out of Berkeley just spun this out with a $400M valuation. Their secret sauce is RadixAttention. In a standard inference engine, when a user sends a prompt, you compute the Key-Value (KV) cache from scratch. RadixArk changes the game by treating the KV cache like a classic LRU cache. It automatically reuses the KV blocks from previous requests if they share a prefix. This is massive for agentic workflows where you have a long system prompt or few-shot examples that never change. You aren’t recomputing the same tokens over and over; you’re just mapping to existing memory. Then there is Inferact, the new commercial face of vLLM, which just raised $150M at an $800M valuation. If RadixArk is about smart caching, Inferact is about brute-force memory efficiency through PagedAttention. Before vLLM, serving a model meant pre-allocating a massive contiguous block of VRAM. It was wasteful; you’d end up with “internal fragmentation” where gigabytes of precious GPU memory sat empty. Inferact applies the operating system concept of “paging” to GPU memory. It breaks the KV cache into non-contiguous blocks that can be scattered anywhere in VRAM. This allows them to batch way more requests together because they aren’t bottlenecked by memory fragmentation. Why is this happening now? Because inference quality is product quality. In the training era, latency didn’t matter—you could wait weeks for a run to finish. In the inference era, latency is everything. If your agent takes 5 seconds to think, it feels broken. If it costs $0.10 per turn, your business model is dead. We are seeing a bifurcation in the stack. On one end, BaseTen manages the hardware abstraction. On the other, RadixArk and Inferact are down in the kernels, squeezing every last FLOP out of the silicon. Strap in. We spent a decade teaching computers to think. Now we have to figure out how to make them think fast. 🔎 AI ResearchAgentic Reasoning for Large Language ModelsAI Lab: University of Illinois Urbana-Champaign, Meta, Amazon, Google DeepMind, UC San Diego, Yale University Summary: This survey formalizes "Agentic Reasoning" as a paradigm shift that transforms Large Language Models from static processors into autonomous agents capable of planning, acting, and self-evolving through interaction. The authors structure the field into foundational, self-evolving, and collective reasoning layers, providing a unified roadmap for optimizing agentic systems via both in-context orchestration and post-training reinforcement learning across domains like science and robotics. Building Production-Ready Probes For GeminiAI Lab: Google DeepMind Summary: This paper introduces the MultiMax probe architecture and utilizes automated architecture search to address the failure of existing activation probes on long-context inputs. The authors demonstrate that these improved probes, especially when paired with cascading classifiers, provide a robust and efficient misuse mitigation system for the Gemini model. Reasoning Models Generate Societies of ThoughtAI Lab: Google, University of Chicago, Santa Fe Institute Summary: This research argues that advanced reasoning in models like DeepSeek-R1 emerges from the implicit simulation of a multi-agent "society of thought" containing diverse perspectives and conversational dynamics. The authors show that reinforcement learning naturally encourages these social behaviors, and that explicitly fine-tuning for conversational structure further accelerates reasoning performance. Multimodal Reinforcement Learning with Agentic Verifier for AI AgentsAI Lab: Microsoft Research Summary: This paper introduces Argos, a verifier that provides granular, multi-objective rewards for training multimodal agents by assessing spatial grounding, reasoning quality, and final accuracy. This approach enables models to achieve state-of-the-art performance on spatial and embodied AI tasks while significantly reducing visual hallucinations through verifiable reinforcement learning. Multiplex Thinking: Reasoning via Token-wise Branch-and-MergeAI Lab: University of Pennsylvania, Microsoft Research Summary: This work presents a “Multiplex Thinking” mechanism that aggregates multiple sampled tokens into a single continuous representation at each step to enable efficient exploration of reasoning paths. This approach facilitates on-policy reinforcement learning without the high cost of long discrete rollouts, leading to superior performance on complex math reasoning benchmarks. ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using AgentsAI Lab: Arizona State University, Intuit AI Research Summary: This paper introduces ToolPRMBench, a specialized benchmark for evaluating Process Reward Models (PRMs) in the context of long-horizon tool-use tasks by focusing on step-level correctness. The study highlights that specialized PRMs and reinforcement learning significantly enhance the ability to detect intermediate errors in agent trajectories compared to general-purpose models. 🤖 AI Tech ReleasesGLM 4.7-FlashZ.ai open sourced GLM-4.7-Flash, a new coding and agentic assistant. LFM2.5 ThinkingLiquidi released a new model for on-device reasoning. 📡AI RadarInference & Infrastructure
Funding & Ventures
Big Tech & Policy
You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Similar newsletters
There are other similar shared emails that you might be interested in:


