This Week in Turing Post: |
Wednesday / AI 101 series: Hybrid AI Friday / Open Source AI series: The Real Math β when open source saves money, when it doesn't + an interview with MiniMax!
|
|
π From our partners: AI for When It Is Rocket Science. Agent Composer now available in the Contextual AI platform! It helps teams tackle expert-level engineering tasks in high-stakes environments β compressing hours of routine (but complex) work into minutes. |
|
What makes it different: |
Unified context layer: Agents operate with full task, data, and workflow context Flexible, controlled agents: Combine dynamic intelligence with structured workflows for mission-critical reliability. Intuitive, no-code build: Create and optimize agents in minutes with pre-built templates, natural language prompts β no rocket science required.
|
|
|
To the main topic: Agentic realities β why agent progress is becoming a systems problem |
I donβt know how it happens, but every week reading research papers gives me an idea for an editorial. There is always a combination of papers that were clearly not coordinated, yet end up answering each other. They expose blind spots when read alone and make sense when read together. And that, I believe, is the value proposition of Turing Postβs digests: noticing things you might otherwise have missed. |
This week, my insights came from two surveys on agents: one is about agentic reasoning as a paradigm. The other is about efficiency and cost in agent systems. Surprisingly, they both describe the same bottleneck but from two sides. |
Quiet boldly, the Agentic Reasoning for Large Language Models survey (Tianxin Wei et al.) argues that reasoning is not a one-shot model call. Instead, it defines reasoning as something that happens across interaction steps: planning, tool use, search, memory updates, feedback, and revision. |
|
In this framing, reasoning is no longer equivalent to producing an internal chain-of-thought. It is closer to a control process over time. The agent maintains state, interacts with an environment, updates internal representations, and decides what to do next. The unit of analysis shifts from an answer to a trajectory. |
This shift explains why so many recent agent papers focus on memory, reflection, and multi-step workflows rather than prompt tricks. Once reasoning is distributed across time, earlier decisions do not disappear. They influence later behavior whether or not they remain valid. |
The survey makes this concrete in how it treats core components of agent behavior. Memory is described as an active element in the reasoning process, shaping future decisions rather than merely storing past information. Feedback appears as a mechanism for updating behavior over time, not simply for scoring outputs. Multi-agent configurations are framed in operational terms as well, focusing on role separation and coordination as ways to maintain consistency across long horizons rather than as mechanisms for improving pointwise accuracy. |
What the paper implicitly acknowledges is that once reasoning becomes interactive, coherence becomes fragile. |
|
The Toward Efficient Agents survey (Xiaofang Yang et al.) picks up exactly at that fragility point. Instead of asking how to design better reasoning mechanisms, it asks what happens when those mechanisms are deployed repeatedly. |
|
The answer is that agent systems accumulate cost and state in ways that are not linear. Token usage compounds across steps. Memory grows faster than relevance. Tool calls introduce latency and retries. Planning depth increases even when marginal gains drop. |
The survey is concrete about this. It decomposes agent cost into components: generation, memory access, tool invocation, retries. Efficiency is not treated as a single metric, but as a system-level tradeoff between effectiveness and resource use. |
The paper is also clear about where it locates the source of the problem. The focus is not on reducing model size or changing model capacity, but on how agent behavior is organized at the system level. Memory is discussed in terms of ongoing compression and filtering. Tool use is treated as something that requires selectivity. Planning is described with an emphasis on explicit termination conditions. Without these mechanisms, performance can deteriorate over longer runs even when individual steps appear correct. |
These concerns are framed in terms of maintaining stable behavior over time rather than improving performance on isolated tasks. |
|
Reading the two surveys together we can see that they describe the same phenomenon from different angles: |
|
Both papers treat memory as an active component that shapes future behavior. Both treat tool use as an action with consequences, not a free capability. Both treat planning depth as something that must be regulated. |
Neither paper claims that agents fail because models cannot reason. The failure mode is structural. Once reasoning persists over time, it requires mechanisms to prevent accumulation from overwhelming the system. |
This is why many agent failures look familiar to anyone with systems experience. Old state interferes with new decisions. Failed paths continue to influence behavior. The system technically works, but its internal structure degrades. |
So whatβs the takeaway? |
These papers suggest that as agents run longer, we see how the main constraint shifts. Progress depends less on how strong individual reasoning steps are and more on whether the system can stay coherent over time. Memory, reasoning, and action all need to be managed, not simply expanded. |
Taken together, the surveys point toward a move from model-centric to systems-level thinking about agents. |
Our news digest is always free. Click on the partnerβs link above to support us or Upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Nvidia, Hugging Face, Microsoft, Google, a16z etc plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand whatβs going on with AI β | |
|
|
|
|
|
|
|
We are watching/reading |
Iβve read the new opus by Dario Amodei and here is my honest take. Please watch and let me know what you think. This episode is not about denying AI risk. The risks are real. The question is whether this essay helps us reason about them, or whether it mainly reveals how Silicon Valley talks to itself when stakes feel existential. |
 | What Dario Amodei Gets Wrong About AI |
|
|
|
News from the usual suspects |
NVIDIA NVIDIA introduces Earth-2, a suite of open-source AI models designed to accelerate weather and climate forecasting. Covering everything from global 15-day forecasts to minute-level storm nowcasting, the platform aims to democratize high-resolution prediction tools and reduce reliance on traditional supercomputing. A notable step toward scalable, accessible weather intelligence. NVIDIA is investing $2B in CoreWeave and expanding their partnership to build over 5 gigawatts of βAI factoriesβ by 2030. CoreWeave will deploy multiple generations of NVIDIA hardwareβincluding Rubin GPUs and Vera CPUsβwhile offering its AI-native software stack to cloud providers and enterprises. It's a tight alignment aimed at scaling infrastructure for the next wave of AI adoption. Microsoft Maia: Inference is forever Microsoft unveils Maia 200, its first custom AI inference chip as part of their heterogenous AI infrastructure β and a strategic move into the heart of where AI economics play out. Built on TSMCβs 3nm process with custom FP8/FP4 cores and 216GB of HBM3e, Maia 200 is optimized for the endless grind of inference. But beyond specs, itβs about integration: by aligning silicon, models, and apps across workloads like GPT-5.2 and Copilot, Microsoft gains a tight feedback loop β and a durable advantage. QDK gets sharper qubits Microsoft expands its Quantum Development Kit with powerful new tools for chemistry and error correction, signaling readiness for the logical qubit era. Fully integrated with VS Code and GitHub Copilot, the QDK simplifies quantum programming and supports major frameworks like Qiskit and Cirq. Itβs a move toward practical applications, built on Microsoftβs vision of a unified platform combining quantum hardware, software, and AIβnow tightly looped into Azure.
|
|
|
π¦ Survey highlight |
|
|
Models (if π¦ - itβs open sourced) |
π¦ Waypoint-1: Real-time interactive video diffusion from Overworld Researchers from Overworld introduced Waypoint-1, a real-time, text-and-input-controllable video diffusion model trained on 10,000 hours of labeled gameplay. Using a frame-causal rectified flow transformer, it generates each frame conditioned on text, mouse, and keyboard with zero latency. The model supports 30 FPS at 4 steps or 60 FPS at 2 steps on consumer GPUs via the WorldEngine library. Training used diffusion forcing and self-forcing to minimize inference mismatch and error accumulation in autoregressive rollouts. βread the tech overview π¦ Kimi K2.5 Researchers from Moonshot AI released Kimi K2.5, a 1 trillion parameter native multimodal LLM with 32B active parameters, trained on 15 trillion visual-text tokens. It integrates vision-language reasoning, tool use, and swarm-like agent execution. The model features 256K context length, MoonViT vision encoder (400M params), and Mixture-of-Experts with 384 experts. It outperforms rivals on benchmarks like MathVista (90.1), OCRBench (92.3), and VideoMMU (86.6), and supports native INT4 quantization and long-context reasoning βread their blog π¦ LongCat-Flash-Thinking-2601 technical report Researchers from Meituan introduced LongCat-Flash-Thinking-2601, a 560B-parameter open-source MoE reasoning model with 27B activated parameters and SOTA agentic reasoning performance. It achieves 88.2 on ΟΒ²-Bench, 29.3 on VitaBench, and 73.1 on BrowseComp. Trained across 32,000 environments in 20+ domains using the DORA framework, it integrates real-world noise and a Heavy Thinking Mode for test-time scaling. Its Zigzag attention variant supports 1M-token context and yields 1.5Γ inference speedup with minimal performance trade-off βread the paper Pushing Qwen3-Max-Thinking beyond its limits Researchers from Qwen introduced Qwen3-Max-Thinking, a flagship reasoning LLM with advanced tool-use and test-time scaling. It achieved top-tier performance on 19 benchmarks, including C-Eval (93.7), HMMT Feb (98.0), and Arena-Hard v2 (90.2). The model autonomously selects tools like Search, Memory, and Code Interpreter. Its multi-round scaling boosts reasoning accuracy (e.g., GPQA from 90.3 to 92.8) with efficient context usage. APIs support OpenAI and Anthropic compatibility via Alibaba Cloud βread the blog Robobrain 2.5: Depth in sight, time in mind Researchers from BAAI introduced RoboBrain 2.5, an 8B-parameter embodied AI model with two major upgrades: Precise 3D Spatial Reasoning and Dense Temporal Value Estimation. It predicts collision-free 3D keypoint traces using (u, v, d) coordinates from monocular RGB inputs and delivers dense, step-aware progress feedback via hop-based value estimation. Trained on 12.4M samples, it achieves SOTA on benchmarks like TraceSpatial (83/63/44 success), MSMU (64.17), and VABench-V (0.1189 error), surpassing prior models in real-world manipulation βread the paper
|
Research this week |
(as always, π indicates papers that we recommend to pay attention to) |
Agent training, mid-training, and experience scaling |
daVinci-Dev: Agent-native Mid-training for Software Engineering Establish agent-native mid-training with executable, feedback-rich trajectories to instill foundational software-engineering behaviors more efficiently than post-training alone βread the paper π Endless Terminals: Scaling RL Environments for Terminal Agents Scale reinforcement learning by procedurally generating executable terminal environments so simple PPO agents improve when environments, not scaffolds, scale βread the paper EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience Evolve native computer-use agents through a self-sustaining loop of task synthesis, sandbox rollouts, and iterative policy refinement βread the paper LLM-in-Sandbox Elicits General Agentic Intelligence Unlock general agentic behavior by letting models explore a code sandbox and optionally reinforce those behaviors without agent-specific training data βread the paper
|
Reinforcement learning systems and optimization theory |
π Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Precision Flow Stabilize and accelerate large-scale RL by unifying FP8 precision across rollout and training to eliminate numerical mismatch βread the paper Your Group-Relative Advantage Is Biased Expose systematic bias in group-relative advantage estimators and correct it with difficulty-aware reweighting for more robust RLVR training βread the paper Behavior Knowledge Merge in Reinforced Agentic Models Merge multiple RL-trained agents by disentangling shared and task-specific updates instead of naively averaging sparse RL task vectors βread the paper
|
Test-time learning, adaptation, and discovery |
π Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers Adapt attention sparsity at inference time by routing heads dynamically to balance efficiency and performance on long-context inputs βread the paper π Learning to Discover at Test Time Perform reinforcement learning at test time to search for one exceptional solution rather than optimizing average performance across tasks βread the paper
|
Strategic reasoning, persuasion, and dialogue |
Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind Model reviewer mental states and persuasion strategies to generate rebuttals grounded in theory-of-mind reasoning rather than surface imitation βread the paper π GameTalk: Training LLMs for Strategic Conversation Optimize long-horizon objectives across full dialogues by training models with conversation-level rewards in multi-agent games βread the paper Paper2Rebuttal: A Multi-Agent Framework for Transparent Author Response Assistance Reframe rebuttal writing as evidence-centric planning with inspectable reasoning and on-demand external search βread the paper
|
Safety, calibration, and reliability of agents |
π Building Production-Ready Probes for Gemini Design activation probes that generalize under long-context and multi-turn distribution shifts for real-world misuse mitigation βread the paper Agentic Confidence Calibration Calibrate agent confidence at the trajectory level by extracting process-level signals that explain and predict failure βread the paper Agentic Uncertainty Quantification Turn verbalized uncertainty into active control signals that dynamically balance fast execution and targeted reflection βread the paper Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces Estimate domain-level accuracy under drift using decoding-time entropy statistics as a scalable monitoring signal βread the paper Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy Reveal how standard fine-tuning silently degrades contextual privacy reasoning while leaving benchmark performance intact βread the paper
|
Architecture limits, attention, and prompt mechanics |
|
Mechanistic interpretability and actionable control |
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability Reframe mechanistic interpretability as an intervention pipeline that enables diagnosis, steering, and measurable model improvement βread the paper A BERTology View of LLM Orchestrations Reuse hidden states from serving LLMs to perform classification in-pass, reducing latency and guard-model overhead βread the paper
|
Systems, organizations, and socio-technical limits |
|
Automated systems and low-level optimization |
|
Thatβs all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. |
How did you like it? |
|
|