If Turing Post is part of your weekly routine, please share it with one smart friend. Itβs the simplest way to keep the Monday digests free. |
|
| This Week in Turing Post: | Wednesday / AI 101 series: Beyond RL: The New Fine-Tuning Stack for LLMs Friday / Interview: Michael Bolin, Codex and Open-Source
| | From our partners: AI Agents Introduce Risk. Modern Infrastructure Reduces It. | | AI agents operate autonomously across infrastructure, but legacy identity systems were built for humans. Static credentials and excessive privilege create unnecessary risk. Teleport provides identity-based access, short-lived credentials, and policy enforcement designed to securely deploy and run AI agents in production. | | | Topic 1: Betting on World Models | AGI, ASI, SCAI, AMI, HAI, etc. If youβre confused by all the AI abbreviations, hereβs one more. Not from us, actually, but from Yann LeCun, who, in his effort to distance himself fully from Meta, has now introduced a new term: SAI, for Superhuman Adaptable Intelligence. For clarity, letβs quickly note the timeline of the terms he has used before. | In 2022 Yann LeCun published A Path Towards Autonomous Machine Intelligence (AMI). In a 2024 Columbia talk he said he hated the term AGI and preferred βAdvanced Machine Intelligence,β (another AMI) adding that Meta had adopted it. Metaβs own research messaging kept using AMI through 2025, including V-JEPA 2 as a step toward that goal, and Reuters later described LeCunβs 2025 startup in the same language. Now, in early 2026, the new paper, where Yann LeCun is a co-author, introduces SAI, Superhuman Adaptable Intelligence. | What is SAI? | Official Definition: Superhuman Adaptable Intelligence (SAI) is capable of adapting to exceed humans at any task humans can do, while also being able to adapt to tasks outside the human domain that have utility. | | To be fair, this looks like a rhetorical pivot rather than a technical conversion. The core program is familiar, but the framing has moved from autonomy, to advancement, to adaptability. My read is that, through this new term, LeCun and coauthors are trying to show that the field is drifting away from one grand, mystical notion of generality and toward layered systems built around specialization, adaptation, and composition. | RL is still important, SSL is still foundational, and world models are still a serious bet, but none of them looks sufficient on its own. The direction now seems more practical: learn broad structure through self-supervision, sharpen behavior through reinforcement, use world models for planning, bring in memory for long-horizon adaptation, causal learning for interventions rather than mere correlations, and symbolic methods wherever correctness and exactness still matter β this is not in the paper, thatβs how I see it. The result is not one universal recipe but a stack of methods that specialize, transfer, and recombine. Which is probably for the best, because, as the paper notices, βthe AI that folds our proteins should not be the AI that folds our laundry.β | And then thereβs topic number two: Autoresearch by Andrej Karpathy (who totally bets on Transformers) |  | Karpathyβs autoresearch: the lab that works while you sleep |
|
| Our news digest is always free. Click on the partnerβs link above to support us or upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Nvidia, Hugging Face, Microsoft, Google, a16z etc plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand whatβs going on with AI β | |
|
| | | |
| | We are reading/watching: | | Arvind Narayanan (@aisnakeoil) | I find Anthropic's behavior perplexing. Anyone who does serious research with these models knows that they don't have stable desires or preferences. Tweak the question slightly and get a different answer. Note that this is a simple empirical observation about model behavior, completely separate from the question of whether models are moral agents with preferences worth respecting. Surely people at Anthropic know this. Why do they persist with this wacky stuff? | substack.com/@aisnakeoil/note/c-220523879 |
| |  |
|
| | News from the usual suspects | Jetson gets a butler NVIDIAβs Jetson AI Lab shows how to turn an AGX Thor or AGX Orin into a fully local OpenClaw assistant that chats through WhatsApp and runs without cloud APIs. The recipe is simple but ambitious: serve a tool-calling model with vLLM, install OpenClaw, link WhatsApp, and let the agent handle files, apps, and web tasks on-device. Handy and private. OpenAI β The Workhorse Gets an Upgrade OpenAI has unveiled GPT-5.4, positioning it as its most capable model for professional work. It merges strong reasoning, coding (thanks to GPT-5.3-Codex), and agent-style computer use into one system that can operate software, handle documents, and execute complex workflows. With up to 1M tokens of context and improved tool use, the model aims to spend less time chatting β and more time actually getting the job done. Apple β A Design Reset in Cupertino Apple has quietly reshuffled its leadership, elevating designers Molly Anderson (industrial design) and Steve Lemay (human interface) to the executive ranks. After years of criticism β from Vision Pro doubts to software missteps β the move signals a renewed emphasis on design. With John Ternus widely seen as Tim Cookβs eventual successor, a trio of hardware, software, and design leadership may redefine Appleβs identity. The $599 MacBook Neo already feels like the opening act. Microsoft β Claude Enters the Cubicle Microsoft is folding Anthropicβs Claude Cowork into Microsoft 365 Copilot under the name Copilot Cowork, giving enterprise users an agent that can build decks, wrangle spreadsheets, and send the meeting email nobody wanted to write. Itβs a shrewd move: Cowork rattled the SaaS establishment when it launched, and Microsoft has decided that if a wave is big enough, better to surf it than watch from shore. Alibaba β A Sudden Exit in the Qwen Lab Alibabaβs Qwen AI project just lost a key architect. Junyang Lin, one of the most visible technical leaders behind the model family, stepped down days after the company launched its new Qwen 3.5 small multimodal models. The timing raised eyebrows across the AI community. With China racing to rival OpenAI, Google, and Anthropic, losing a central figure mid-momentum isβ¦ less than ideal.
| | π‘ Benchmark Highlight | |  | Image Credit: The original paper |
| Researchers from Princeton University introduce Interactive Benchmarks evaluating LLM reasoning via budgeted multi-turn interaction instead of static datasets. Framework models agentβenvironment exchanges with query costs or discounted rewards. Two domains: Interactive Proofs using a 46-instance Situation Puzzle set and Interactive Games including Texas Holdβem and Trust Game. On a 52-problem HLE math subset, interactive accuracy reaches 76.9% versus pass@k drops of 20β50%. Gemini-3-flash leads with 30.4% accuracy; GPT-5-mini follows next βread the paper | π¦ Models Highlight | Phi-4-reasoning-vision-15B technical report Researchers from Microsoft Research developed Phi-4-reasoning-vision-15B, a compact open-weight multimodal model optimized for vision-language tasks, scientific reasoning, mathematics, and user-interface understanding. Careful architecture design and rigorous data curation allow competitive performance with substantially less training and inference compute. Systematic filtering, error correction, and synthetic augmentation improve data quality. High-resolution dynamic-resolution vision encoders boost perception, while mixed reasoning and non-reasoning datasets with explicit mode tokens enable fast answers or chain-of-thought reasoning βread the paper Olmo hybrid: Combining transformers and linear RNNs for superior scaling Researchers from Ai2 introduced Olmo Hybrid, a 7B hybrid LLM combining transformer attention with Gated DeltaNet linear RNN layers in a 3:1 pattern (75% DeltaNet). Pretrained on 6T tokens using 512 GPUs (H100 β B200), it achieves ~2Γ token efficiency: matching Olmo-3 accuracy on MMLU with 49% fewer tokens and parity on Common Crawl with 35% fewer. With DRoPE at 64k context, it scores 85.0 RULER vs 70.9 for Olmo-3 βread the paper Dynamic chunking diffusion transformer Researchers from AMD developed DC-DiT, a Diffusion Transformer that adaptively compresses image tokens using a learned encoderβrouterβdecoder chunking system. Uniform regions receive fewer tokens while detail-rich areas receive more, with compression varying across diffusion timesteps. On class-conditional ImageNet 256Γ256, the model improves FID and Inception Score versus parameter- and FLOP-matched DiT baselines at 4Γ and 16Γ compression, enabling checkpoint upcycling and fewer training steps with lower compute cost βread the paper Helios: Real real-time long video generation model Researchers from Peking University and ByteDance introduced Helios, a 14B autoregressive diffusion video generator producing 19.5 FPS on a single NVIDIA H100 while supporting minute-scale videos. A unified representation enables T2V, I2V, and V2V tasks. Training simulates drifting failures to prevent long-video degradation and repetitive motion. Heavy compression of historical and noisy context plus fewer sampling steps yields compute comparable to 1.3B models while outperforming prior short- and long-video systems βread the paper
| | Research this week | (as always, π indicates papers that we recommend to pay attention to) | This week is about building the scaffolding that makes models usable, reliable, and durable in the wild: | Verification is becoming central Agents are getting longer-horizon Reinforcement learning is spreading everywhere Synthetic data is becoming core infrastructure Memory is returning as a major research frontier World models and embodied prediction keep expanding Multimodality is becoming more unified Efficiency remains a primary battleground Benchmarks are getting more realistic Safety is shifting toward agentic settings
| Agents, memory, tool use, and grounded action | CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification Builds agent-training data around explicit constraints that also double as verifiers, which is interesting because interactive agents usually fail where ambiguity meets deterministic action. βread the paper π Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory Replaces lossy summarization with indexed external memory that can be dereferenced later, which is interesting because long-horizon agency probably needs retrieval over experience, not just compressed chat history. βread the paper π KARL: Knowledge Agents via Reinforcement Learning Combines synthetic data generation, multi-task search training, and off-policy RL for enterprise search agents, which is interesting because it treats grounded knowledge work as an agent training problem rather than a pure retrieval problem. βread the paper π Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use Makes refusal and safety checks explicit inside the action loop, which is interesting because agent safety breaks when it is bolted on after planning instead of embedded into planning. βread the paper
| Code agents and software reasoning | BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? Expands code-agent evaluation beyond neat repo-local bug fixes, which is interesting because real software work is messy, cross-repository, dependency-heavy, and rarely benchmark-friendly. βread the paper π Agentic Code Reasoning Introduces semi-formal reasoning as a certificate for code understanding without execution, which is interesting because it points toward static semantic verification that could plug directly into agent training and review loops. βread the paper
| Truthfulness, interpretability, and behavioral control | π Reasoning Models Struggle to Control their Chains of Thought Tests whether reasoning models can deliberately shape what appears in their chain of thought, which is interesting because it bears directly on whether CoT monitoring is about to become useless or remains informative. βread the paper π Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation Uses politically censored models as a naturally occurring dishonesty benchmark, which is interesting because most lie-detection setups are artificial and therefore too clean. βread the paper π Spilled Energy in Large Language Models Recasts decoding through an energy-based lens and uses energy inconsistencies to detect hallucinations, which is interesting because it offers a training-free route to error detection straight from logits. βread the paper Spectral Attention Steering for Prompt Highlighting Steers attention by editing key embeddings before attention is computed, which is interesting because it makes prompt highlighting compatible with efficient attention implementations instead of fighting them. βread the paper
| World models, embodied dynamics, and scientific simulation | Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model Compresses world-model observations down to a tiny token budget, which matters because planning is often bottlenecked by representation overhead rather than the planner itself βread the paper Chain of World: World Model Thinking in Latent Motion Connects world modeling with visuomotor learning by separating motion from scene structure, which is interesting for robotics because it pushes prediction toward what actually changes. βread the paper π Operator Learning Using Weak Supervision from Walk-on-Spheres Reframes neural PDE learning around cheap stochastic supervision instead of expensive datasets or fragile PINN objectives, which makes it a practical bridge between scientific computing and learned operators. βread the paper SciDER: Scientific Data-centric End-to-end Researcher Extends research agents from literature-and-code loops into raw scientific data handling, which is interesting because it shifts autonomous science closer to actual experimental workflows. βread the paper
| Reasoning improvement through data, RL, and verification | CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning Shows that a relatively compact synthetic dataset can still move reasoning performance meaningfully when it is broad, structured, and automatically validated. βread the paper Learn Hard Problems During RL with Reference Guided Fine-tuning Uses partial reference solutions to help models enter the reward-yielding region on hard problems, which is interesting because it tackles reward sparsity without forcing the model to imitate alien human trajectories. βread the paper Tool Verification for Test-Time Reinforcement Learning Replaces shaky majority-vote pseudo-labels with tool-grounded verification, which is interesting because online self-improvement usually falls apart when the reward signal starts hallucinating confidence. βread the paper π V1: Unifying Generation and Self-Verification for Parallel Reasoners Treats verification as pairwise comparison instead of isolated scoring, which is interesting because models often judge relative correctness better than absolute correctness. βread the paper Heterogeneous Agent Collaborative Reinforcement Learning Lets different agents share useful rollouts while still acting independently at inference time, which is interesting because it turns heterogeneity into training signal instead of treating it as noise. βread the paper Surgical Post-Training: Cutting Errors, Keeping Knowledge Focuses post-training on minimally corrected trajectories, which is interesting because it aims to improve reasoning without paying the usual catastrophic-forgetting tax. βread the paper
| Production tuning and real-world model behavior | π CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production Documents a full production flywheel for improving social-chat models with live traffic, which is interesting because it exposes how model behavior is actually tuned when the benchmark is user interaction rather than lab elegance. βread the paper
| Training efficiency, optimization, and model compression | Progressive Residual Warmup for Language Model Pretraining Stabilizes pretraining by letting earlier layers settle before deeper ones fully engage, which is interesting because it treats depth as an optimization schedule rather than a fixed stack. βread the paper SAGEBWD: A Trainable Low-Bit Attention Pushes low-bit attention from fine-tuning territory into pretraining territory, which matters because efficient attention only really changes the game when it survives full training. βread the paper On-Policy Self-Distillation for Reasoning Compression Shrinks reasoning traces without relying on external labels or fixed token budgets, which is interesting because it treats verbose reasoning as something to optimize away rather than simply tolerate. βread the paper
| Thatβs all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. | How did you like it? | | |
|