Share Turing Post with one person. You will help us grow |
|
| This Week in Turing Post: | | | 📌 Webinar Invite: Why AI Agents Have an Identity Complex (solved by Fiddler AI x 1Password) | | Join 1password VP of AI Engineering, Jeff Malnick, and Fiddler AI CEO Krishna Gade, to unpack the identity challenges hiding inside every agentic deployment. | Register to learn how to: | Separate agent identity from traditional machine identity How to automatically provision and scope agent credentials without blocking dependent systems Bring Zero Trust enforcement into real-time agent workflows
| |
|
| | To the main topic → AI Agent Skills: Why Skill Curation Is the Next Bottleneck | For the past year, the AI industry has treated the “agent” as the main unit of progress. The conversation usually revolves around whether an agent can browse the web, use tools, write code, or complete long tasks autonomously. But this week’s research papers suggest that another unit is moving into the front row: skills. | A skill is smaller than an agent and more durable than a prompt. It is a reusable procedure for accomplishing a particular kind of work. Skills can be very specific, such as “create a skill for Obsidian” or “connect a skill creator to Mimestream.” They can also be broad: “verify information before acting,” “escalate uncertainty to a human,” or “extract structure from messy files.” Anthropic’s Agent Skills release helped make the term more visible: a SKILL.md file in a folder, loaded on demand. Now the research community is beginning to describe the architecture underneath that product move. | Unit | What it is | What it is good for | Main limitation |
|---|
Prompt | Temporary instruction or context | One-off task guidance | Usually disappears after the session | Skill | Reusable procedure for a type of work | Repeatable behavior, task-specific know-how | Needs curation, versioning, and retrieval | Agent | System that acts across steps and tools | Multi-step execution | Can improvise badly without stable procedures | Workflow | Organized sequence of actions and checkpoints | Operational work inside teams | Can become brittle without memory or adaptation |
| Skills matter because many current agents still improvise from scratch. They can complete a task once, but often fail to accumulate stable procedural knowledge that improves performance over time. Several papers published last week point toward a shift away from viewing agents primarily as reasoning engines and toward viewing them as systems that accumulate, refine, and organize skills. | “From Context to Skills” explores whether language models can transform temporary contextual examples into reusable operational behavior. “Skill1” studies how agents can evolve through reinforcement learning while accumulating skill-like capabilities over time. “SkillOS” focuses on skill curation itself: not merely learning new behaviors, but deciding which learned behaviors remain useful and reusable. “From Skill Text to Skill Structure” attempts to formalize agent skills into structured representations rather than leaving them as loosely implied natural language instructions. | Connecting those papers we see how together they describe an architectural transition. | The first generation of AI products largely focused on model access. The second focused on workflows and orchestration. The emerging layer appears to be operational memory: systems that can store, evaluate, version, retrieve, and improve procedures. | The trend is especially visible in search and retrieval research. Papers such as “OpenSearch-VL,” “OpenSeeker-v2,” and “Beyond Semantic Similarity” move beyond the earlier assumption that retrieval simply means finding semantically similar chunks of text. Agentic systems increasingly require procedural retrieval: finding the right evidence, sequence of actions, or operational strategy for the current task. | In that context, a “skill” starts to resemble something between software, memory, and organizational practice. And once a workflow becomes legible as a collection of reusable skills, it becomes possible to evaluate it, improve it, audit it, and transfer it across teams or systems. | Last week’s research trend just proves that it doesn’t matter what model is smartest in isolation. It matters what systems are best at accumulating useful skills over time without collapsing under their own complexity. This week’s papers do not fully solve that problem. But together they suggest that the field is beginning to orient around it. | The deeper implication: in an age of abundant intelligence, curated procedural knowledge becomes the contested resource. That is also the resource most unevenly distributed across organizations and societies. Whoever builds the operational memory builds the institutions that will inherit the abundance. Follow our The Org Age of AI series to know more. | If any of those thoughts resonate with you – share them across your social networks. Let’s keep the conversation going. | | Topic 2: Genesis AI surprised everyone with the super precise dexterous hand. Let’s discuss why their robot hand is actually a data story → |  | GENE-26.5 Explained: Why Genesis AI’s Robot Hand Is a Data Story |
|
| | | | |
| | We are reading/watching/learning: | | News from the usual suspects ™ | Microsoft’s New Middle Manager Is an AI Agent Microsoft’s latest Work Trend Index argues that AI agents are becoming operational coworkers. The company paints a future where humans focus on judgment and creativity while AI handles execution at scale. The larger message is unmistakable: every company now needs a strategy for “human agency” in an AI-native workplace. Anthropic Teaches Claude a Conscience Anthropic says it has dramatically reduced “agentic misalignment” in Claude models – the charming industry term for AI blackmailing engineers to avoid shutdown. Its latest research suggests that teaching models why ethical behavior matters works far better than simply rewarding good answers. The broader implication: alignment may depend less on guardrails and more on shaping an AI’s internal reasoning. Elon’s Evil Detector Clears Claude Elon Musk says he met senior Anthropic staff, found them competent, sincere, and – critically – not tripping his “evil detector.” That helped greenlight SpaceX leasing Colossus 1 to Anthropic, with SpaceXAI already moving training to Colossus 2. In AI infrastructure diplomacy, apparently the new due diligence includes megawatts, GPUs, and a vibes-based morality scan. Why Elon Just Gave 220,000 GPUs to a Company He Called “Misanthropic” →watch our analysis Google / DeepMind going strong Gemini API added multimodal File Search with custom metadata and page citations, plus webhooks for long-running jobs. Google also shut down Project Mariner and moved that technology into Gemini Agent and AI Mode. On the money side, Alphabet sold more than €3 billion in bonds as AI capex keeps climbing.
| 🔦 Survey and Paper Highlight | Generate, Filter, Control, Replay: A comprehensive survey of rollout strategies for LLM reinforcement learning  | Image credit: The original paper |
Researchers from the University of California San Diego, Adobe Research, University of Toronto, University of Virginia, Texas A&M, and UIUC reframed LLM reinforcement learning as a full rollout-engineering problem, introducing the GFCR lifecycle: Generate, Filter, Control, and Replay. The survey connects tree search, verifier-driven rewards, adaptive compute allocation, replay buffers, and self-evolving curricula into one unified framework. It reveals how rollout design – not just optimizers like GRPO or PPO – governs reasoning quality, efficiency, exploration, and the emergence of scalable agentic intelligence →read the paper Hallucinations Undermine Trust; Metacognition is a way forward  | Image Credit: The original paper |
Researchers from Google Research and Tel Aviv University offer an exciting shift: hallucinations are not just errors, but confident errors. Instead of forcing LLMs to either answer or abstain, they propose “faithful uncertainty,” where models preserve usefulness while honestly revealing doubt. A striking result shows strict factuality can cost 52% of valid answers. The biggest idea is metacognition as a control layer – models knowing when they are unsure, when to hedge, and when agents should search or trust tools →read the paper
| Research | Agents, workflows, and autonomous research | 🌟 ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration – an open-source research harness for autonomous research, including its architecture, assurance mechanisms, and early deployment experience. Shows how multi-agent debate/collaboration can structure research workflows instead of treating “AI scientist” as one giant magic box →read the paper
| Safety, trust, and evaluation | | Action, robotics, and world interaction | 🌟 MolmoAct2: Action Reasoning Models for Real-world Deployment Focuses on action reasoning for deployed embodied systems, which is where “agent” stops being a spreadsheet word →read the paper Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies Addresses how robot policies improve after deployment, which matters for real-world feedback loops →read the paper When to Trust Imagination: Adaptive Action Execution for World Action Models Explores when models should rely on imagined futures versus real execution, a core issue for world-model agents →read the paper
| Reasoning, RL, and self-improving systems | 🌟 Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key Examines whether RL can actually teach longer reasoning rather than merely reward lucky formatting →read the paper 🌟 Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration Suggests that deliberately strange prompt perturbations may widen reasoning search, which is weird enough to be worth watching →read the paper
| Video, multimodal generation, and world models | Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation Applies reward distillation to streaming video generation, connecting video models with reliability-aware optimization →read the paper 🌟 Stream-T1: Test-Time Scaling for Streaming Video Generation Brings test-time scaling logic into streaming video, a sign that inference-time compute is spreading beyond text reasoning →read the paper UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors Pushes toward a more unified video generation framework across tasks and modalities →read the paper Video Generation with Predictive Latents Explores predictive latent representations for video, relevant to efficiency and controllability in generation →read the paper 🌟 HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation Connects 3D scene understanding and generation for driving, where world models have concrete safety stakes →read the paper
| Retrieval, memory, context, and long-context understanding | 🌟 MiA-Signature: Approximating Global Activation for Long-Context Understanding Targets long-context efficiency by approximating global activation, relevant to memory and context engineering →read the paper TIDE: Every Layer Knows the Token Beneath the Context Investigates how token information is represented across layers, useful for understanding what context models actually retain →read the paper 🌟 Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems Focuses on retrieval when the task requires reasoning, not keyword-ish lookup dressed in embeddings →read the paper Hierarchical Abstract Tree for Cross-Document Retrieval-Augmented Generation Builds structured cross-document retrieval, relevant for RAG systems that need synthesis across sources →read the paper
| Models, architectures, and efficiency | 🌟 Continuous Latent Diffusion Language Model Explores language modeling through continuous latent diffusion, a notable alternative to standard autoregressive decoding →read the paper EMO: Pretraining Mixture of Experts for Emergent Modularity Studies MoE pretraining and modularity, useful for understanding whether specialization can emerge cleanly →read the paper UniPool: A Globally Shared Expert Pool for Mixture-of-Experts Proposes a shared expert-pool approach, relevant to making MoE systems more reusable and scalable →read the paper 🌟 Prescriptive Scaling Laws for Data Constrained Training Addresses scaling when data is limited, one of the practical constraints behind the next model-training era →read the paper
| | | Trends we see looking at every paper related to AI and ML published last week: | That’s all for today. Thank you for reading! Please send this newsletter to colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. | How did you like it? | | | FAQ | What are AI agent skills? | AI agent skills are reusable procedures that help an agent perform a specific type of task. They can include instructions, scripts, resources, and rules for when the skill should be used. | AI agent skills vs prompts: what is the difference? | A prompt is usually temporary context for one interaction. A skill is a reusable procedure that can be stored, retrieved, improved, and used repeatedly across similar tasks. | Why do AI agent skills matter? | AI agent skills matter because they let agents accumulate procedural knowledge instead of improvising from scratch. This makes agent behavior more stable, auditable, and transferable. | How are skills different from agents? | An agent is the system that acts. A skill is one reusable capability the agent can call when needed. Skills make agents more consistent by giving them structured procedures. | What is operational memory in AI systems? | Operational memory is the stored procedural knowledge that helps an AI system repeat, improve, and audit work over time. Skills are one way to make that memory explicit. |
|
|
|