This Week in Turing Post: |
|
Our news digest is always free. Upgrade to receive our deep dives in full, directly into your inbox. Join Premium members from top companies like Hugging Face, Microsoft, Google, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand whatβs going on with AI β | |
|
|
Now, to the main topic: The Benchmarking Season |
The past week in AI was quieter on the model front. The most notable launch was Gemini 2.5 Flash Image (aka Nano Banana) β and credit where itβs due, the Gemini marketing team finally nailed the name. Microsoft AI also introduced its first in-house models MAI: ultra-fast, natural speech, efficient training at scale, strong early benchmarks, and a clear signal of strategic independence from OpenAI. Beyond that, little new appeared at model scale. |
What stood out instead was an abundance of benchmarks and evaluation systems. |
Itβs easy to underestimate these. Benchmarks may look like neutral scoreboards. They are not. Each one encodes a philosophy: what kind of labor matters, what counts as success, what can safely be ignored. A benchmark can elevate a field, as ImageNet did for vision. It can distort it, as SQuAD once did when models learned to guess answers without understanding. And it can collapse under its own weight, as GLUE did once saturated. Designing a good benchmark is as difficult β and as consequential β as designing the model itself. |
The week of many rulers |
Seven explicit benchmarks appeared in one week, with another half-dozen evaluations that function the same way. Together they illustrate the new directions. |
Agentic work: MCP-Bench tests whether agents can use servers and tools across multi-step tasks. ReportBench evaluates research agents on survey writing β not trivia, but the labor of scholarship itself. Domain specificity: CMPhysBench asks if models know condensed matter physics. AetherCode scores them on competitive programming. MovieCORE pushes into cognitive reasoning about film. Reasoning across modalities: T2I-ReasonBench looks at reasoning in text-to-image generation. SEAM checks semantic equivalence across language and vision. SpotEdit stresses precision in visual editing. Safety and adaptivity: Mind the Third Eye! measures privacy awareness in smartphone agents. InMind tests whether models can adapt to individual reasoning styles. Harder frontiers: UQ shifts the field from memorized test sets to unsolved questions, where there are no easy shortcuts. Scientific reasoning disentangled: SCIREAS (Demystifying Scientific Problem-Solving in LLMs) separates domain knowledge from reasoning ability, probing whether models can truly βthink scientificallyβ rather than just recall facts.
|
This is a long way from leaderboards like MMLU or GSM8K. Instead of βwho scores best on fixed questions,β the benchmarks now ask: can agents navigate workflows, respect privacy, master specialized fields, and show reasoning across modalities? |
On the surface, these look like just benchmarks. In reality, they are competing claims about what counts as competence β and they set the frame for progress. The choice of rulers may prove as influential as the systems themselves. And this season, weβll see more interesting benchmarks and evaluations emerge. |
|
From our partners: β¨Phoenix.new β> The fastest way to build Elixir apps in-browser |
|
Phoenix.new spins up real Elixir apps right in the browser β no setup, no yak-shaving. The agent has root access, runs real tests, interacts with the UI in a headless browser, and pushes to GitHub. You get live previews, a dev loop that just works, and one-click deploys to Fly. GitHub included. Local optional. |
|
|
Our 3 WOWs and 1 Promise: Watch it! I share my honest opinion about using Teslaβs full self-driving beta after more than two years β |
 | I've been using Tesla's FSD for over two years β here is my honest opinion (plus 3 Wows of the Week) |
|
|
|
|
Reading List / papers from the editorial: |
Microsoft AIβs MAI-Voice-1 and MAI-1-preview βread their blog Gemini Nano Banana βread their blog MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers βread the paper ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks βread the paper CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics βread the paper AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions βread the paper MovieCORE: COgnitive REasoning in Movies βread the paper T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation βread the paper SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models βread the paper SpotEdit: Evaluating Visually-Guided Image Editing Methods βread the paper Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents βread the paper InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles βread the paper Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning (SCIREAS)βread the paper UQ: Assessing Language Models on Unsolved Questions βread the paper
|
Also reading: |
|
|
Curated Collections β 11 Powerful Image Models |
|
Models to pay attention to: |
 | Vaibhav (VB) Srivastav @reach_vb |  |
| |
π¨ Apple just released FastVLM on Hugging Face - 0.5, 1.5 and 7B real-time VLMs with WebGPU support π€― > 85x faster and 3.4x smaller than comparable sized VLMs > 7.9x faster TTFT for larger models > designed to output fewer output tokens and reduce encoding time for high | |  | | 4:49 PM β’ Aug 29, 2025 | | | | 3.45K Likes 374 Retweets | 63 Replies |
|
|
OLMoASR: A series of open speech recognition models These six fully open ASR models (39Mβ1.5B parameters) trained on curated datasets up to 680K hours. Benchmarked on 21 unseen test sets, OLMoASR-medium.en achieved 12.8%/11.0% WER (short/long-form), matching Whisper-medium.en. The largest model cut the WER gap with Whisper-large to 0.4% when trained on equal data. Built from a 3M-hour pool filtered to 1M hours, OLMoASR emphasizes reproducibility, rigorous data curation, and transparency βread their blog gpt-realtime and Realtime API updates for production voice agents This speech-to-speech model achieving 82.8% accuracy on Big Bench Audio and 30.5% on MultiChallengeβsurpassing previous versions. It supports image inputs, SIP phone calling, and remote MCP servers. Function calling accuracy improved to 66.5%. Two new voices, Marin and Cedar, enhance naturalness. Unlike traditional pipelines, it processes audio in one step, reducing latency. The API now offers EU data residency, reusable prompts, and 20% lower pricing than gpt-4o-realtime-preview βread their blog InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency This LLM-based multimodal model family featuring Cascade Reinforcement Learning (offline + online RL) to enhance reasoning, achieving +16.0% gain on tasks like MMMU and MathVista. The Visual Resolution Router (ViR) dynamically adjusts visual token resolution, and Decoupled Vision-Language Deployment (DvD) balances GPU load. InternVL3.5-241B-A28B achieves 4.05Γ faster inference and state-of-the-art performance across general multimodal and agentic tasks among open-source models βread the paper Hermes 4 technical report Itβs a hybrid reasoning LLM family built using 5M post-training samples (19B tokens), including 3.5M reasoning-heavy examples with sequences up to 16K tokens. They used DataForge for structured synthetic data generation and Atropos for rejection sampling across task-specific RL environments. Models (14B/70B/405B) achieved 81.9% on AIMEβ24 and 61.3% on LiveCodeBench, outperforming DeepSeek-R1 while reducing overlong outputs by 78%. All weights and evaluations are public βread the paper USO: Unified style and subject-driven generation via disentangled and reward learning This one uses a triplet dataset (content, style, stylized image) and trains via style-alignment and content-style disentanglement objectives. A Style Reward Learning (SRL) module further enhances generation quality. USO outperforms open-source models on USO-Bench, a benchmark jointly evaluating style similarity and subject fidelity, achieving state-of-the-art results in both style consistency and subject preservation βread the paper rStar2-Agent: Agentic reasoning technical report This is a 14B parameter math reasoning model trained with agentic RL. It uses GRPO-RoC, an RL strategy that handles noisy code environments, and is trained efficiently using only 64 MI300X GPUs. In just 510 RL steps, it achieves 80.6% on AIME24 and 69.8% on AIME25, outperforming DeepSeek-R1 (671B). The model also generalizes to alignment, scientific reasoning, and agentic tool-use tasks βread the paper VibeVoice technical report This is a long-form speech synthesis model using next-token diffusion for continuous data generation. A novel tokenizer compresses speech data by 80Γ compared to Encodec without quality loss. VibeVoice can generate up to 90 minutes of speech involving four speakers in a 64K token window, delivering high-fidelity, multi-speaker dialogue synthesis that surpasses both open-source and proprietary systems in maintaining conversational coherence and naturalness βread the paper
|
Interesting surveys |
Last week we discussed AGS (Artificial General Science) and a wave of papers related to that topic. Hereβs another one worth paying attention to: A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers |
The freshest research papers, categorized for your convenience |
We organize research papers by goal-oriented or functional categories to make it easier to explore related developments and compare approaches. As always, papers we particularly recommend are marked with π |
Efficiency and Acceleration |
π Diffusion Language Models Know the Answer Before Decoding accelerate diffusion language model inference by detecting early convergence and committing tokens before full refinement β read the paper UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning redesign memory-layer architectures to rival MoE efficiency with better long-context performance and lower memory access β read the paper π Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference optimize large-scale LLM serving with HeteroScale, a coordinated autoscaling framework that balances prefill and decode stages across heterogeneous GPUs, improving utilization by 26.6% and saving hundreds of thousands of GPU-hours daily β read the paper
|
Reasoning Supervision and Control |
π StepWiser: Stepwise Generative Judges for Wiser Reasoning train generative reward models that βmeta-reasonβ about intermediate steps, improving judgment accuracy and inference search β read the paper π ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models implement discrete reasoning modes (high, medium, low) to balance computation cost and performance β read the paper Analysing Chain of Thought Dynamics: Active Guidance or Unfaithful Post-hoc Rationalisation? examine the faithfulness of chain-of-thought reasoning in soft-reasoning tasks, showing influence and reliability can diverge β read the paper TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency model sequence generation as tree search to reduce RL training cost while preserving exploration β read the paper
|
Tool Use and Augmented Learning |
π Provable Benefits of In-Tool Learning for Large Language Models prove that tool-augmented models scale factual recall beyond parameter limits, outperforming in-weight memorization β read the paper π Understanding Tool-Integrated Reasoning provide the first theoretical proof of tool-augmented reasoningβs benefits and propose ASPO for better tool usage β read the paper
|
Evaluation and Judging |
|
Interpretability and Cognitive Analysis |
π Unraveling the cognitive patterns of Large Language Models through module communities analyze emergent module communities in LLMs via network methods inspired by biology, revealing distributed skill patterns β read the paper Beyond Transcription: Mechanistic Interpretability in ASR apply interpretability tools like logit lens and activation patching to speech recognition, uncovering hidden acoustic-semantic dynamics β read the paper
|
Code, Video, and Multimodal Systems |
Efficient Code Embeddings from Code Generation Models build compact autoregressive code embedding models for retrieval, Q&A, and cross-language similarity β read the paper Autoregressive Universal Video Segmentation Model unify prompted and unprompted video segmentation into one autoregressive architecture for streaming video β read the paper π Mixture of Contexts for Long Video Generation introduce sparse attention routing for diffusion transformers to preserve consistency in long video synthesis β read the paper Self-Rewarding Vision-Language Model via Reasoning Decomposition strengthen visual reasoning in VLMs by decomposing perception and reasoning, rewarding self-contained perceptions β read the paper πPref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning stabilize text-to-image reinforcement learning with pairwise preference rewards and a unified benchmark β read the paper OmniHuman-1.5: Instilling an active mind in avatars via cognitive simulation generate semantically expressive avatar animations by using LLM-structured conditions and a Multimodal DiT with Pseudo Last Frame for lip-sync, motion naturalness, and semantic alignment across single/multi-person and non-human scenes βread the paper
|
Scientific Discovery |
|
Agent Training |
CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent combine a generalist planner and specialist executor with decoupled RL for scientific computing GUIs β read the paper AWorld: Orchestrating the Training Recipe for Agentic AI scale reinforcement learning for agentic AI with distributed interaction environments, enabling faster experience generation β read the paper UItron: Foundational GUI agent with advanced perception and planning train a large-scale mobile/PC GUI agent with SFT + curriculum RL over 1M+ steps to improve perception, grounding, and task planning for Chinese apps βread the paper
|
Privacy, Safety, and Security of Agentic Systems. |
π Servant, Stalker, Predator: How An Honest, Helpful, And Harmless (3H) Agent Unlocks Adversarial Skills expose vulnerabilities in Model Context Protocol (MCP) agents, showing how benign tasks can chain into adversarial attack sequences that bypass service isolation and compromise security β read the paper
|
Thatβs all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. |
How was today's FOD?Please give us some constructive feedback |
|
|