This Week in Turing Post: | | | π€ From our partners: Vault-Free Privileged Access for Modern Engineering Teams | | As AI and cloud infrastructure scale, managing privileged access with static credentials and vaults becomes both a bottleneck and a risk. Teleport replaces rotated credentials and vaulted secrets with real Zero Trust, issuing short-lived, cryptographic certificates at runtime for every human, machine, and AI agent. | Discover how vault-free PAM reduces risk and accelerates engineering. | | Our news digest is always free. Click on the partnerβs link above to support us or Upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Nvidia, Hugging Face, Microsoft, Google, a16z etc plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand whatβs going on with AI β | |
|
| | Last week at CES: Robots! More Robots! And Jensen Huang says they will have human-level capabilities THIS year. We went to see if robots were aware of that. Watch the video :) |  | Jensen Huang says robots will have human capabilities this year! The Robots at CES Had.. Other Plans |
|
| Also last week: Why OpenAI and Anthropic Chose Healthcare at the Same Time | Right after the holidays, both OpenAI and Anthropic announced healthcare-focused initiatives within days of each other. For the first time, I donβt think about it as a competition, what I like about it is that itβs a signal that healthcare has crossed a threshold where staying out is no longer the cautious choice. | For several years, healthcare was treated as a deferred domain for leading AI labs. Understandably: the sector is heavily regulated, operationally fragmented, and unforgiving to confident mistakes. Earlier generations of models were difficult to bound, difficult to audit, and prone to failure modes that could not be cleanly isolated from their successes. In low-stakes domains, this was ok. In healthcare β not at all. | The decision by both labs to move now implies a shared conclusion that something fundamental has changed. The models are for sure more capable now, but most importantly β they are more governable. | Healthcare is therefore better understood as a systems test rather than a market opportunity. This is a hugely important step in AI adoption. | Another moment worth mentioning: doctors should not be worried. What AI is being applied to is coordination. Itβs an old problem in healthcare that no one is structurally positioned to assemble full context under time pressure: information is distributed across multiple systems, and signals from medications, labs, imaging, wearables, genetics, and prior history are rarely considered together when decisions are made β and patients are left to play detectives putting all the pieces together on their own. In this framing, LLMs are not making medical judgments. They mainly help bring existing information together so it can be reviewed more easily. | Both labs appear to believe this coordination role is now stable enough to turn into a product. | Where the two labs differ is in how they approach this coordination role. | OpenAI is extending its general assistant into healthcare, treating health data as another high-value context that can sit alongside documents, calendars, and enterprise tools, with additional privacy and access controls layered on top. The underlying assumption is that a single, familiar interface can serve patients, clinicians, and administrative workflows, as long as the boundaries around data use are clearly defined. | Anthropic is taking a narrower approach. Its healthcare effort is oriented less toward a patient-facing assistant and more toward embedding Claude inside existing institutional workflows. The emphasis is on predictable behavior, limited scope, and alignment with how healthcare organizations already operate. Rather than broad continuity across use cases, the focus is on fitting cleanly into specific professional contexts. | The choices what to focus on reflect different theories of how trust is built in regulated systems. One assumes trust emerges from continuity and widespread use, the other from constraint and institutional alignment. It is not yet clear which approach will prove more durable, and it is possible that both will coexist in different parts of the system. What matters is that both labs are now willing to test their models in an environment where responsibility cannot remain abstract. Iβm very excited about this new development. | | | |
| | | We are reading | | News from the usual suspects | Gmail Gets Gemini-fied Gmail is stepping into 2026 with Gemini AI at the helm. Googleβs flagship inbox now offers AI Overviews to summarize email threads, answer natural language queries, and filter clutter with the upcoming βAI Inbox.β Help Me Write and Suggested Replies get smarter, while proofreading goes premium. Itβs no longer just email β itβs your AI-powered executive assistant. Apple + Google: The Gemini Marriage Apple has picked Googleβs Gemini to power the long-delayed AI upgrade to Siri, marking a rare alliance between rivals. The multiyear partnership puts Gemini models at the core of Appleβs upcoming βFoundation Models,β keeping compute mostly on-device and in Appleβs private cloud. Apple remains mum on the $1B/year price tag, but this move signals Cupertino is finally showing up to the AI arms race β fashionably late, of course. Musk's Macrohard Moment xAI, Elon Muskβs AI venture, torched $7.8 billion in just nine months, chasing its dream of powering humanoid robots like Optimus. Despite swelling quarterly losses, revenue doubled to $107 million, and a $20B cash injection (featuring Nvidia) suggests the spending spree is far from over. "Macrohard" may be a pun on Microsoft β but the burn rate is no joke.
| | π¦ Research highlight | | Researchers from MIT CSAIL present Recursive Language Models (RLMs), a novel inference-time architecture enabling LLMs to process arbitrarily long prompts β scaling beyond 10 million tokens, over 100Γ typical context windows. Instead of consuming the prompt directly, RLMs offload it into a Python REPL as a variable (context), allowing the LLM to symbolically interact with the prompt via code. The model can read, transform, and decompose the context and recursively call sub-LLMs through a built-in llm_query() function. This enables dynamic task decomposition, selective context access, and unbounded reasoning. RLMs require no retraining and work with existing models (GPT-5, Qwen3-Coder), achieving up to 2Γ higher accuracy than base LLMs and long-context agents on benchmarks like BrowseComp+, OOLONG, and OOLONG-Pairs, while keeping inference cost comparable or lower. Ablation studies confirm the critical role of both the REPL environment and recursive sub-calls in solving complex, information-dense tasks. This is a significant step forward because RLMs break the fundamental context window barrier of LLMs β enabling scalable, symbolic, and recursive reasoning over massive inputs without retraining or architectural changes βread the paper | Models | Liquid: LFM2.5 β The Next Generation of On-Device AI Release an open-weight 1.2B-class model family optimized for edge agents by extending pretraining to 28T tokens, scaling post-training with multi-stage reinforcement learning, and shipping text, Japanese, vision-language, and native audio variants with day-zero runtime support across common inference stacks and NPUs βread the paper MiMo-V2-Flash Technical Report Deliver fast, strong reasoning and agentic performance by combining a large MoE backbone with hybrid attention, multi-token prediction, and multi-teacher on-policy distillation to push decoding speed and parameter efficiency βread the paper K-EXAONE Technical Report Provide a multilingual MoE foundation model with long-context support that targets balanced reasoning, agentic, and industrial capabilities across multiple major languages βread the paper LTX-2: Efficient Joint Audio-Visual Foundation Model Generate temporally synchronized video and audio in a single unified model by coupling asymmetric modality-specific transformers through cross-attention for efficient, controllable audiovisual synthesis βread the paper
| | Research this week | (π indicates papers that we recommend to pay attention to) | World models, environments, and embodied learning | Digital Twin AI: Opportunities and Challenges from Large Language Models to World Models Unify how AI augments digital twins across modeling, mirroring, intervention, and autonomous management stages βread the paper π WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks (Microsoft) Provide a large-scale, non-stationary web environment with rubric-based rewards to train and evaluate visual web agents βread the paper Scaling Behavior Cloning Improves Causal Reasoning Show that scaling data and depth in behavior cloning improves causal policies in real-time video game agents βread the paper Evolving Programmatic Skill Networks Grow a compositional network of executable skills that reflect, refactor, and stabilize over time in open-ended environments βread the paper
| Agents, tools, and orchestration | Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning Route across models and tools using training-free priors and reinforcement learning to exploit heterogeneity in complex reasoning tasks βread the paper MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning Interleave multimodal chain-of-thought reasoning with autonomous tool invocation to solve open-ended, real-world problems βread the paper RelayLLM: Efficient Reasoning via Collaborative Decoding Coordinate small and large models at the token level so lightweight models request help only when needed to cut inference cost βread the paper π Over-Searching in Search-Augmented Large Language Models (Apple) Diagnose when retrieval harms efficiency and truthfulness and propose metrics and mitigations for search overuse βread the paper β Can We Predict Before Executing Machine Learning Agents? Replace costly execution with predictive reasoning by internalizing execution priors and using a predict-then-verify loop βread the paper GenCtrl: A Formal Controllability Toolkit for Generative Models Formalize controllability as a control problem and estimate controllable sets to expose the limits of human influence over generation βread the paper
| Agent memory, long-horizon reasoning, and experience compression | SimpleMem: Efficient Lifelong Memory for LLM Agents Compress interaction histories into high-density semantic memory units, consolidate them asynchronously into abstractions, and retrieve them adaptively to reduce token cost while preserving long-term performance βread the paper MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents Represent memories across semantic, temporal, causal, and entity graphs and retrieve them via policy-guided traversal to enable interpretable, query-aligned long-horizon reasoning βread the paper Memory Matters More: Event-Centric Memory as a Logic Map for Agent Searching and Reasoning Organize experiences into an event graph with explicit logical relations to support structured navigation over memory instead of shallow similarity search βread the paper Distilling Feedback into Memory-as-a-Tool Amortize inference-time critique by storing feedback as retrievable guidelines that agents can reuse as a tool to reduce reasoning cost βread the paper
| Agent evaluation, verification, and confidence | Agent-as-a-Judge Evolve evaluation from single-pass model judging to agentic judges with planning, tools, collaboration, and memory to enable verifiable multi-step assessment βread the paper Agentic Rubrics as Contextual Verifiers for SWE Agents Generate repository-specific rubric checklists via agent interaction to verify code patches without executing tests while remaining grounded and interpretable βread the paper Confidence Estimation for LLMs in Multi-turn Interactions Measure and improve confidence calibration across turns by formalizing monotonicity and per-turn reliability as context accumulates βread the paper Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency Evaluate belief robustness by probing consistency across contextual neighborhoods rather than relying on point-wise self-consistency βread the paper
| Reasoning dynamics, structure, and control | DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs Reformulate chain-of-thought generation as an iterative denoising process to enable retrospective correction of reasoning steps βread the paper The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning Analyze long reasoning traces as structured interaction patterns and guide the synthesis of stable reasoning trajectories βread the paper Mechanistic Interpretability of Large-Scale Counting in LLMs through a System-2 Strategy Decompose large counting tasks into reliable subproblems and trace how intermediate counts are represented and aggregated inside the model βread the paper Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners Probe how latent reasoning forms across languages and show that internal reasoning dynamics largely follow an English-centered pathway βread the paper Parallel Latent Reasoning for Sequential Recommendation Scale reasoning width by exploring multiple latent reasoning trajectories in parallel to improve generalization under real-time constraints βread the paper
| Training efficiency, data efficiency, and optimization | SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving Push lightweight supervised fine-tuning to state-of-the-art SWE performance through curated datasets, curriculum design, and verifier-based test-time scaling βread the paper One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling Demonstrate that a single, carefully engineered training sample can unlock broad reasoning gains across domains via reinforcement learning βread the paper Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting Suppress destructive gradients on confident-but-conflicting tokens by gating updates with entropy to reduce catastrophic forgetting during fine-tuning βread the paper Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers Replace fixed norm equilibria with learnable scaling factors to adapt weight magnitudes to data and improve downstream performance βread the paper π GDPO: Group reward-Decoupled Normalization Policy Optimization (Nvidia) Decouple reward normalization in multi-reward reinforcement learning to preserve signal resolution and improve training stability βread the paper
| | Thatβs all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. | How did you like it? | | |
|