Beyond vibe checks: A PM’s complete guide to evals
Beyond vibe checks: A PM’s complete guide to evalsHow to master the emerging skill that can make or break an AI product
👋 Welcome to a 🔒 subscriber-only edition 🔒 of my weekly newsletter. Each week I tackle reader questions about building product, driving growth, and accelerating your career. For more: Lennybot | Podcast | Courses | Hiring | Swag Annual subscribers now get a free year of Perplexity Pro, Notion, Superhuman, Linear, and Granola. Subscribe now. I’m going to keep this intro short because this post is so damn good, and so damn timely. Writing evals is quickly becoming a core skill for anyone building AI products (which will soon be everyone). Yet there’s very little specific advice on how to get good at it. Below you’ll find everything you need to understand wtf evals are, why they are so important, and how to master this emerging skill. Aman Khan runs a popular course on evals developed with Andrew Ng, is Director of Product at Arize AI (a leading AI company), and has been a product leader at Spotify, Cruise, Zipline, and Apple. He was also a past podcast guest and is launching his first Maven course on AI product management this spring. If you’re looking to get more hands-on, definitely check out Aman’s upcoming free 30-minute lightning lesson on April 18th: Mastering Evals as an AI Product Manager. You can find Aman on X, LinkedIn, and Substack. Now, on to the post. . . After years of building AI products, I’ve noticed something surprising: every PM building with generative AI obsesses over crafting better prompts and using the latest LLM, yet almost no one masters the hidden lever behind every exceptional AI product: evaluations. Evals are the only way you can break down each step in the system and measure specifically what impact an individual change might have on a product, giving you the data and confidence to take the right next step. Prompts may make headlines, but evals quietly decide whether your product thrives or dies. In fact, I’d argue that the ability to write great evals isn’t just important—it’s rapidly becoming the defining skill for AI PMs in 2025 and beyond. If you’re not actively building this muscle, you’re likely missing your biggest opportunity for impact-building AI products. Let me show you why. Why evals matterLet’s imagine you’re building a trip-planning AI agent for a travel-booking website. The idea: your users type in natural language requests like “I want a relaxing weekend getaway near San Francisco for under $1,000,” and the agent goes off to research the best flights, hotels, and local experiences tailored to their preferences. To build this agent, you’d typically start by selecting an LLM (e.g. GPT-4o, Claude, or Gemini) and then design prompts (specific instructions) that guide the LLM to interpret user requests and respond appropriately. Your first impulse might be to feed user questions into the LLM directly to get out responses one by one, as with a simple chatbot, before adding capabilities to turn it into a true “agent.” When you extend your LLM-plus-prompt by giving it access to external tools—like flight APIs, hotel databases, or mapping services—you allow it to execute tasks, retrieve information, and respond dynamically to user requests. At that point, your simple LLM-plus-prompt evolves into an AI agent, capable of handling complex, multi-step interactions with your users. For internal testing, you might experiment with common scenarios and manually verify that the outputs make sense. Everything seems great—until you launch. Suddenly, frustrated customers flood support because the agent booked them flights to San Diego instead of San Francisco. Yikes. How did this happen? And more importantly, how could you have caught and prevented this error earlier? This is where evals come in. What exactly are evals?Evals are how you measure the quality and effectiveness of your AI system. They act like regression tests or benchmarks, clearly defining what “good” actually looks like for your AI product beyond the kind of simple latency or pass/fail checks you’d usually use for software. Evaluating AI systems is less like traditional software testing and more like giving someone a driving test:
Just as you’d never let someone drive without passing their test, you shouldn’t let an AI product launch without passing thoughtful, intentional evals. Evals are analogous to unit testing in some ways, with important differences. Traditional software unit testing is like checking if a train stays on its tracks: straightforward, deterministic, clear pass/fail scenarios. Evals for LLM-based systems, on the other hand, can feel more like driving a car through a busy city. The environment is variable, and the system is non-deterministic. Unlike in traditional software testing, when you give the same prompt to an LLM multiple times, you might see slightly different responses—just like how drivers can behave differently in city traffic. With evals, you’re often dealing with more qualitative or open-ended metrics—like the relevance or coherence of the output—that might not fit neatly into a strict pass/fail testing model. Getting startedDifferent eval approaches
Importantly, LLM-based evals are natural language prompts themselves. That means that just as building intuition for your AI agent or LLM-based system requires prompting, evaluating that same system also requires you to describe what you want to catch. Let’s take the example from earlier: a trip-planning agent. In that system, there are a lot of things that can go wrong, and you can choose the right eval approach for each step in the system. Standard eval criteriaAs a user, you want evals that are (1) specific, (2) battle-tested, and (3) test for specific areas of success. A few examples of common areas evals might look at are:
Other common areas for eval would be: Phoenix (open source) maintains a repository of off-the-shelf evaluators here.* Ragas (open source) also maintains a repository of RAG-specific evaluators here. *Full disclosure: I’m a contributor to Phoenix, which is open source (there are other tools out there too for evals, like Ragas). I’d recommend people get started with something free/open source, which won’t hold their data hostage, to run evals! Many of the tools in the space are closed source. You never have to talk to Arize/our team to use Phoenix for evals. The eval formulaEach great LLM eval contains four distinct parts:
Here’s a concrete example. Below is an example eval for toxicity/tone for your trip planner agent. The workflow for writing effective evalsEvals aren’t just a one-time check. Gathering data to evaluate, writing evals, analyzing the results, and integrating feedback from evals is an iterative workflow from initial development through continuous improvement after launch. Let’s use the trip planning agent example from earlier to illustrate the process for building an eval from scratch. Phase 1: CollectionLet’s say you’ve launched your trip planning agent and are getting feedback from users. Here’s how you can use that feedback to build out a dataset for evaluation:
Phase 2: First-pass evaluationNow that you have a dataset consisting of real-world examples, you can start writing an eval to measure a specific metric, and test the eval against the dataset. For example: You might be trying to see if your agent ever answers in a tone that reads as unfriendly to the end user. Even if a user of your platform gives negative feedback, you may want your agent to respond in a friendly tone.
Phase 3: Iteration loop
Phase 4: Production monitoring
Common mistakes I’ve seen teams make when adopting evals:
Writing good evals forces you into the shoes of your user—they are how you catch “bad” scenarios and know what to improve on. What’s next?Now that you understand the fundamentals, here’s exactly how to start with evals in your own product:
For a detailed example of how to build a hallucination eval, check out our guide here, as well as our hands-on course on Evaluating AI Agents. Looking aheadAs AI products become more complex, the ability to write good evals will become increasingly crucial. Evals are not just about catching bugs; they help ensure that your AI system consistently delivers value and delights your users! Evals are a critical step in going from prototype to production with generative AI. I would love to hear from you: How are you currently evaluating your AI products? What challenges have you faced? Leave a comment👇 📚 Further study
Thank you, Aman! Have a fulfilling and productive week 🙏 If you’re finding this newsletter valuable, share it with a friend, and consider subscribing if you haven’t already. There are group discounts, gift options, and referral bonuses available. Sincerely, Lenny 👋 Invite your friends and earn rewardsIf you enjoy Lenny's Newsletter, share it with your friends and earn rewards when they subscribe. |
Similar newsletters
There are other similar shared emails that you might be interested in: