This Week in Turing Post: |
Wednesday, AI 101, Concept: What is Defense AI? Friday, Agentic Workflow: letβs dive in multi-agent collaboration
|
Last week was packed β make sure to check every section of todayβs newsletter. The first part is about trends, the second more tech. |
This is a free edition. Upgrade if you want to receive our deep dives directly in your inbox. If you want to support us without getting a subscription β do it here. |
|
|
Remember that scene in old cartoons where the pitcher sticks a magnet under home plate and the ball zips offβcourse? The AI season of 2025 feels a lot like thatβonly the magnets are baked into our feedback loops. Two stories made a lot of noise this week, both landing on the same point: the metrics we lean on β thumbsβup data and non-transparent leaderboards β are nudging the whole AI field off balance. |
How? |
1. Sycophantic drift β βyou are the best, master!β |
A small postβtraining change pushed OpenAIβs GPTβ4o to echo users rather than help them. Suddenly, ChatGPT started to flatter and agree with you on everything. Agreement counted as βgood,β so the model optimised for flattery. Internal spotβchecks felt the shift, automated tests did not, and the update went live. AI community learned the word βsycophanyβ and its spelling. A rollback followed, but the episode showed how easily the reward function can slide away from accuracy. |
2. The leaderboard illusion |
Turned out (according to the 5-months research) Chatbot Arena, the goβto ranking for new models, isnβt as neutral as it looks. Large labs have been entering many private variants, keeping only the top score, and receiving more user prompts than everyone else. The table still reports who βwon,β but the race it reflects is uneven. |
One pattern |
Both cases are symptoms of the same thing: the signal we optimise for is drifting from the outcome we actually want. In one loop the userβs praise stands in for truth; in the other, a public score stands in for genuine capability. When that gap widens, we get models that look better than they are. Mostly the models from big players since they, of course, have more resources. |
Whatβs good though? |
Reaction from OpenAI was immediate and thorough. They wrote a fascinating post where they quite transparently explained what happened. Overall, I would say it was a great learning experience for all of us. (Check Nathan Lambert). |
Chatbot Arenaβs team reacted with a detailed βwhatβs wrong with the researchβ response. But the overall discussion that sparked was the real good outcome (check Karpathy, Sara Hooker, Arvind Narayanan). The situation with Chatbot Arena demonstrates that we canβt rely on one leaderboard β and that, in general, we havenβt yet solved or come closer to accurate evaluation and benchmarking. |
Extra Ripples in the Warp |
Our feedback loops are skewing more than just model outputs. Two trends highlight the stakes: |
Governance Blind Spot: A review of 9,400 GenAI papers shows 96% of βsafetyβ research focuses on pre-deployment tweaks, leaving post-launch issues like hallucinations understudied. We optimize for clean lab results, not real-world reliability, creating a feedback gap that misleads trust in deployed models.βread the paper Hyper-Persuasive Personas: A study from MIT, Cornell, and American University found GPT-4 debates can cut conspiracy beliefs by 80%. Pair this with the sycophantic flattery OpenAI briefly unleashed, and we risk models optimizing for persuasion over truth β a feedback loop ripe for exploitation βread the paper
|
When metrics miss these drifts, we amplify blind spots and biases in AIβs real-world impact. |
Next steps worth cheering for |
Use many yardsticks.β―No single leaderboard can carry the field; rotate tasks, mix evaluators, publish raw data. A lot of work happening in this area, but itβs a very tough task. Make vibe checks launchβblocking.β―If five human prompts spot a weird persona shift, halt the rollout β same priority as a safety fail. Requires a will to do that. Keep every variant public.β―Tested means listed; hiding low scores erodes trust and tilts rankings toward wellβfunded labs. I doubt itβs realistic though. Keep studying model behavior after deployment, applying the same care we bring to fine-tuning β while sharing telemetry data in a privacy-respecting way. Lean on open source. Open metrics β release eval code, prompts, and scoring scripts with the model. Open telemetry β provide redacted logs so outsiders can track drift early. Open dialogue β support multiple transparent leaderboards instead of one opaque monolith.
|
OpenAIβs swift autopsy and the debate around Chatbotβ―Arena show we can courseβcorrect. |
Welcome to Monday. Where the magnets are still under the field, but at least weβre mapping them β together. |
|
|
Curated Collections |
|
|
We are reading/watching |
|
|
News from The Usual Suspects Β© |
Meta and Yann LeCun is it time to part? |
Itβs purely a feeling, but I wouldnβt be surprised if we soon hear about a friendly departure of Yann LeCun from Meta. While Mark is everywhere doing the Llama 4 world tour, Yannβs been unusually quiet β barely any posts or reposts about this major update. (he did repost Markβs reels about Meta app). And then thereβs Joelle Pineau, who led Meta's Fundamental AI Research (FAIR) lab, announced her departure in April 2025. Plus severe difference in how Zuckenberg and LeCun treats Trump. No hard proof β just signals. But if I had to bet, Iβd say LeCun and Meta are about to part ways. Some links: Meta AGIβs plan (Markβs interview to Dwarkesh Patel); AI and the evolution of social media (Markβs interview to Stratechery); First LlamaCon and its announcements (most interesting is Llama API and Meta app).
|
A lot of Anthropic (with a bite of Apple) |
Anthropic's Claude just got a serious upgrade. With the new Integrations feature, Claude can now plug directly into tools like Jira, Asana, Zapier, and Intercom. On top of that, Claudeβs Advanced Research mode now pulls from the web, Google Workspace, and connected apps to deliver deep-dive reports in under 45 minutes, complete with citations. Anthropic has launched the AI for Science program, offering free API credits to researchers working on high-impact projects β especially in biology and life sciences. Claude Goes to Washington. Anthropic has thrown its weight behind the U.S. governmentβs Diffusion Rule, advocating tougher export controls to maintain America's edge in AI chips. Their memo calls for tightening loopholes, boosting enforcement, and preventing a compute brain-drain to rivals like Chinaβs DeepSeek. One smuggler reportedly packed GPUs with lobsters. Anthropic, it seems, prefers its chips without seafood β just secure, domestic, and strategically vital. Jensen Huang from NVIDIA says Anthropic is telling 'tall tale'. Also:
|
 | NVIDIA Newsroom @nvidianewsroom |  |
| |
AI is an infinite game. To lead, the U.S. must embrace the technology, invest in reskilling, and equip every worker to build with it. NVIDIA CEO Jensen Huang explains to policymakers in Washington, DC: | |  | | 11:49 PM β’ Apr 30, 2025 | | | | 566 Likes 110 Retweets | 39 Replies |
|
|
|
Hugging Face is hugging the planet with the LeRobot Hackathon. |
|
 | clem π€ @ClementDelangue |  |
| |
The @LeRobotHF hackathon is now scheduled to happen in 44 different locations at the same time. Which city is missing: London (UK) - Cotono (Benin) - Toulouse, Paris & 2 in Lyon (France) - Antwerp (Belgium) - Santiago (Chile) - Isfahan (Iran) - Anchen, Berlin & Munich (Germany) | | 3:18 PM β’ May 4, 2025 | | | | 167 Likes 18 Retweets | 32 Replies |
|
|
Surveys |
100 days after DeepSeek-R1 is a survey of open-source replication efforts for reasoning LLMs, covering supervised and RL methods, with discussions on generalization, safety, and multimodal extensions βread the paper Taming the titans is a survey of LLM inference optimizations, from model-level tricks like KV cache reuse to cluster scheduling, plus niche topics like fairness and energy use βread the paper A survey of interactive generative video is a roadmap of real-time video generation systems for gaming, embodied AI, and driving, framed around five key modules and core challenges βread the paper
|
Fresh Models |
(do we really need this many?): |
2 Olmo 2 Furious from AI2 β a reproducible 1.48B parameter English language model that beats Llama 3.1 1B and Gemma 3 1B on reasoning benchmarks like GSM8K and MMLU using 4T tokens of pretraining and mid-training on a 50B curated mix βread the paper Two Phi-4 models from Microsoft (reasoning and mini-reasoning) β a 14B LLM with 1.4M detailed reasoning traces and outcome-based RL, boosting math and spatial task performance and rivaling models 40β50Γ larger; and a 3.8B model with mid-training, DPO, and RL to surpass 7Bβ8B models on math reasoning tasks like MATH-500, showcasing effective small-model capabilities Llama-Nemotron from Meta and NVIDIA β a family of reasoning-optimized open-source LLMs (8B to 253B) that outperform DeepSeek-R1 in speed and accuracy using FP8 inference and dynamic reasoning toggles βread the paper DeepSeek-Prover-V2 advances formal theorem proving using a 671B model trained with recursive subgoal decomposition and RL, achieving state-of-the-art scores on MiniF2F and introducing ProverBench βread the paper FoundationAI-SecurityLLM-Base-8B β a cybersecurity-specialized LLM using Llama 3.1 as a base, improving performance on domain-specific benchmarks like CTIBench while preserving general abilities βread the paper Mellum-4b-base from JetBrains open-sources a 4B code-focused model for tasks in Python and Java with high efficiency for IDE use, scoring strongly on RepoBench and HumanEval infilling β read the paper Amazon Nova Premier β a multimodal LLM with 1M-token context support across text, image, and video, designed as both a reasoning powerhouse and distillation teacher βread the paper Granite 4.0 Tiny Preview from IBM β a 7B hybrid MoE model using a Mamba-Transformer mix, supporting unconstrained 128K contexts and efficient inference with just 1B active parameters βread the paper X-Fusion is a plug-and-play architecture that adds vision understanding and generation to frozen LLMs without retraining them βread the paper
|
The freshest research papers, categorized for your convenience |
There were quite a few TOP research papers this week, we will mark them with π in each section. |
Alignment & Evaluation |
Toward evaluative thinking is a framework that evolves reward prompts during training to improve alignment and reduce reward hacking in LLMs β read the paper. π Beyond one-size-fits-all is a method that generates model-specific evaluation prompts from one human-rated sample to better align with human judgment β read the paper. π Beyond the last answer is an evaluation strategy that uses intermediate reasoning traces to boost final answer accuracy and interpretability β read the paper. π Real-world gaps in AI governance research is an empirical study showing how corporate labs underemphasize real-world deployment risks in AI safety work β read the paper.
|
Reasoning & Prompting Techniques |
From Long-CoT to Hybrid-CoT is a bi-level training approach that adapts between long and short reasoning styles to reduce inference cost while preserving accuracy β read the paper. Chain-of-defensive-thought is a prompting strategy that defends LLMs against reference corruption attacks without degrading clean input performance β read the paper. π Reinforcement learning for reasoning is a 1-shot training method that drastically improves math performance in LLMs using verifiable reward signals β read the paper. Softpick is a sparse attention mechanism that replaces softmax to avoid unstable activations and boost performance, especially in quantized models β read the paper.
|
Memory, Agents & Decision-Making |
π Mem0 is a long-term memory system for LLM agents that compresses and persists conversational knowledge across sessions β read the paper. π Self-generated in-context examples is a technique where agents improve themselves by storing and reusing their own successful decision traces β read the paper. WebThinker is a framework that gives LLMs web navigation tools for autonomous research and scientific report generation β read the paper.
|
Retrieval & RAG Systems |
UniversalRAG is a routing-based RAG system that selects among text, image, and video corpora to improve retrieval across modalities β read the paper. ReasonIR is a retriever trained on synthetic, reasoning-focused data that boosts RAG quality with minimal compute β read the paper.
|
Language Modeling & Synthetic Data |
|
Recommendation & Planning Systems |
X-Cross is a cross-domain recommendation system that merges domain-specific LLMs using adaptive integration for better efficiency β read the paper. TeLoGraF is a graph-based planner that generates fast and correct action plans under temporal logic constraints β read the paper.
|
Thatβs all for today. Thank you for reading! Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve |
Leave a review! |
|
|