Beyond vibe checks: A PM’s complete guide to evals

From:

Lenny's Newsletter <lenny@substack.com>

To:

Hidden Recipient <hidden@emailshot.io>

Date:

4/8/2025, 12:32 PM

͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

View in browser

Beyond vibe checks: A PM’s complete guide to evals

How to master the emerging skill that can make or break an AI product

Aman Khan

Apr 8

∙

Paid

∙

Guest post

READ IN APP

👋 Welcome to a 🔒 subscriber-only edition 🔒 of my weekly newsletter. Each week I tackle reader questions about building product, driving growth, and accelerating your career. For more: Lennybot | Podcast | Courses | Hiring | Swag

Annual subscribers now get a free year of Perplexity Pro, Notion, Superhuman, Linear, and Granola. Subscribe now.

I’m going to keep this intro short because this post is so damn good, and so damn timely.

Writing evals is quickly becoming a core skill for anyone building AI products (which will soon be everyone). Yet there’s very little specific advice on how to get good at it. Below you’ll find everything you need to understand wtf evals are, why they are so important, and how to master this emerging skill.

Aman Khan runs a popular course on evals developed with Andrew Ng, is Director of Product at Arize AI (a leading AI company), and has been a product leader at Spotify, Cruise, Zipline, and Apple. He was also a past podcast guest and is launching his first Maven course on AI product management this spring. If you’re looking to get more hands-on, definitely check out Aman’s upcoming free 30-minute lightning lesson on April 18th: Mastering Evals as an AI Product Manager. You can find Aman on X, LinkedIn, and Substack.

Now, on to the post. . .

After years of building AI products, I’ve noticed something surprising: every PM building with generative AI obsesses over crafting better prompts and using the latest LLM, yet almost no one masters the hidden lever behind every exceptional AI product: evaluations. Evals are the only way you can break down each step in the system and measure specifically what impact an individual change might have on a product, giving you the data and confidence to take the right next step. Prompts may make headlines, but evals quietly decide whether your product thrives or dies. In fact, I’d argue that the ability to write great evals isn’t just important—it’s rapidly becoming the defining skill for AI PMs in 2025 and beyond.

If you’re not actively building this muscle, you’re likely missing your biggest opportunity for impact-building AI products.

Let me show you why.

Why evals matter

Let’s imagine you’re building a trip-planning AI agent for a travel-booking website. The idea: your users type in natural language requests like “I want a relaxing weekend getaway near San Francisco for under $1,000,” and the agent goes off to research the best flights, hotels, and local experiences tailored to their preferences.

To build this agent, you’d typically start by selecting an LLM (e.g. GPT-4o, Claude, or Gemini) and then design prompts (specific instructions) that guide the LLM to interpret user requests and respond appropriately. Your first impulse might be to feed user questions into the LLM directly to get out responses one by one, as with a simple chatbot, before adding capabilities to turn it into a true “agent.” When you extend your LLM-plus-prompt by giving it access to external tools—like flight APIs, hotel databases, or mapping services—you allow it to execute tasks, retrieve information, and respond dynamically to user requests. At that point, your simple LLM-plus-prompt evolves into an AI agent, capable of handling complex, multi-step interactions with your users. For internal testing, you might experiment with common scenarios and manually verify that the outputs make sense.

Everything seems great—until you launch. Suddenly, frustrated customers flood support because the agent booked them flights to San Diego instead of San Francisco. Yikes. How did this happen? And more importantly, how could you have caught and prevented this error earlier?

This is where evals come in.

What exactly are evals?

Evals are how you measure the quality and effectiveness of your AI system. They act like regression tests or benchmarks, clearly defining what “good” actually looks like for your AI product beyond the kind of simple latency or pass/fail checks you’d usually use for software.

Evaluating AI systems is less like traditional software testing and more like giving someone a driving test:

Awareness: Can it correctly interpret signals and react appropriately to changing conditions?
Decision-making: Does it reliably make the correct choices, even in unpredictable situations?
Safety: Can it consistently follow directions and arrive safely at the intended destination, without going off the rails?

Just as you’d never let someone drive without passing their test, you shouldn’t let an AI product launch without passing thoughtful, intentional evals.

Evals are analogous to unit testing in some ways, with important differences. Traditional software unit testing is like checking if a train stays on its tracks: straightforward, deterministic, clear pass/fail scenarios. Evals for LLM-based systems, on the other hand, can feel more like driving a car through a busy city. The environment is variable, and the system is non-deterministic. Unlike in traditional software testing, when you give the same prompt to an LLM multiple times, you might see slightly different responses—just like how drivers can behave differently in city traffic. With evals, you’re often dealing with more qualitative or open-ended metrics—like the relevance or coherence of the output—that might not fit neatly into a strict pass/fail testing model.

An example eval prompt to detect frustrated users

Getting started

Different eval approaches

Human evals: These are human feedback loops you can design into your product (i.e. showing a thumbs-up/thumbs-down or a comment box next to an LLM response, for your user to provide feedback). You can also have human labelers (i.e. subject-matter experts) provide their labels and feedback, and use this for aligning the application with human preferences via prompt optimization or fine-tuning a model (aka reinforcement learning from human feedback, or RLHF).
- Pro: Directly tied to the end user.
- Cons: Very sparse (most people don’t hit that thumbs-up/thumbs-down), not a strong signal (what does a thumbs-up or -down mean?), and costly (if you want to hire human labelers).
Code-based evals: Utilizing checks on API calls or code generation (i.e. was the generated code “valid” and can it run?).
- Pros: Cheap and fast to write this eval.
- Cons: Not a strong signal; great for code-based LLM generation but not for more nuanced responses or evaluations.
LLM-based evals: This technique utilizes an external LLM system (i.e. a “judge” LLM), with a prompt like the one above, to grade the output of the agent system. LLM-based evals allow you to generate classification labels in an automated way that resembles human-labeled data—without needing to have users or subject-matter experts label all of your data.
- Pro: Scalable (it’s like a human label but much cheaper) and natural language, so the PM can write prompts. You can also get the LLM to generate an explanation.
- Con: Need to create LLM-as-a-judge (with some small amount of data to start).

Importantly, LLM-based evals are natural language prompts themselves. That means that just as building intuition for your AI agent or LLM-based system requires prompting, evaluating that same system also requires you to describe what you want to catch.

Let’s take the example from earlier: a trip-planning agent. In that system, there are a lot of things that can go wrong, and you can choose the right eval approach for each step in the system.

Standard eval criteria

As a user, you want evals that are (1) specific, (2) battle-tested, and (3) test for specific areas of success. A few examples of common areas evals might look at are:

Hallucination: Is the agent accurately using the provided context, or is it making things up?
1. Useful for: When you are providing documents (e.g. PDFs) for the agent to perform reasoning on top of

Toxicity/tone: Is the agent outputting harmful or undesirable language?
1. Useful for: End-user applications, to determine if users may be trying to exploit the system or the LLM is responding inappropriately

Overall correctness: How well is the system performing at its primary goal?
1. Useful for: End-to-end effectiveness; for example, question-answering accuracy—how often is the agent actually correct at answering a question provided by a user?

Other common areas for eval would be:

Phoenix (open source) maintains a repository of off-the-shelf evaluators here.* Ragas (open source) also maintains a repository of RAG-specific evaluators here.

*Full disclosure: I’m a contributor to Phoenix, which is open source (there are other tools out there too for evals, like Ragas). I’d recommend people get started with something free/open source, which won’t hold their data hostage, to run evals! Many of the tools in the space are closed source. You never have to talk to Arize/our team to use Phoenix for evals.

The eval formula

Each great LLM eval contains four distinct parts:

Part 1: Setting the role. You need to provide the judge-LLM a role (e.g. “you are examining written text”) so that the system is primed for the task.
Part 2: Providing the context. This is the data you will actually be sending to the LLM to grade. This will come from your application (i.e. the message chain, or the message generated from the agent LLM).
Part 3: Providing the goal. Clearly articulating what you want your judge-LLM to measure isn’t just a step in the process; it’s the difference between a mediocre AI and one that consistently delights users. Building these writing skills requires practice and attention. You need to clearly define what success and failure look like to the judge-LLM, translating nuanced user expectations into precise criteria your LLM judge can follow. What do you want the judge-LLM to measure? How would you articulate what a “good” or “bad” outcome is?
Part 4: Defining the terminology and label. Toxicity, for example, can mean different things in different contexts. You want to be specific here so the judge-LLM is “grounded” in the terminology you care about.

Here’s a concrete example. Below is an example eval for toxicity/tone for your trip planner agent.

The workflow for writing effective evals

Evals aren’t just a one-time check. Gathering data to evaluate, writing evals, analyzing the results, and integrating feedback from evals is an iterative workflow from initial development through continuous improvement after launch. Let’s use the trip planning agent example from earlier to illustrate the process for building an eval from scratch.

Phase 1: Collection

Let’s say you’ve launched your trip planning agent and are getting feedback from users. Here’s how you can use that feedback to build out a dataset for evaluation:

Gather real user interactions: Capture real examples of how users engage with your app. You can do this via direct feedback, analytics, or manual inspection of interactions within your application.
1. For example: Capture human feedback (thumbs-up/down) from your users interacting with the agent. Try to build out a dataset representative of real-world examples that have human feedback.
2. If you don’t collect feedback from your application, you can also take a sample of data and have subject-matter experts (or even PMs!) label the data.

Document edge cases: Identify the unusual or unexpected ways users interact with your AI, as well as any atypical responses from the agent.
1. As you inspect specific examples, you might want a dataset that is balanced across topics. For example:
  - Help booking a hotel
  - Help booking a flight
  - Asking for support
  - Asking for trip planning advice

Build a representative dataset: Collect these interactions into a structured dataset, ideally annotated with “ground truth” (human labels) for accuracy. I’d recommend having between 10 and 100 examples with human labels to start with, as a rule of thumb, to use as ground truth for evaluation. Start simple—spreadsheets are great initially—but eventually consider open source tools like Phoenix for logging and managing data efficiently. I’m biased—I helped build Phoenix, but only because I was struggling with this myself. My recommendation would be to use a tool that is open source and easy to use for logging your LLM application data and prompts when getting started.

Phase 2: First-pass evaluation

Now that you have a dataset consisting of real-world examples, you can start writing an eval to measure a specific metric, and test the eval against the dataset.

For example: You might be trying to see if your agent ever answers in a tone that reads as unfriendly to the end user. Even if a user of your platform gives negative feedback, you may want your agent to respond in a friendly tone.

Write initial eval prompts: Clearly specify the scenarios you’re testing for, following the four-part formula above.
1. For example, the initial eval might look something like:
  - Setting the role: “You are a judge, evaluating written text.”
  - Providing the context: “Here is the text : {text}” → In this case, {text} is a variable, where you will be providing the “LLM agent answer” in the variable of the prompt.
  - Providing the goal: “Determine whether the LLM agent response was friendly.”
  - Defining the terminology and label: “‘Friendly’ would be defined as using an exclamation point in response and generally being helpful. The response should never have a negative tone.”

Run evals against your dataset: You will run the eval by sending the eval prompt plus LLM agent answer variable to an LLM, and get back a label for each row in your dataset.
1. Aim for at least 90% accuracy compared with your human-labeled ground truth.

Identify patterns in failures: Where does the eval fall short? Iterate on your prompt.
1. In the example below: The eval disagrees with the human label in the last example. Our prompt above requires an exclamation point for an LLM agent response to be considered friendly. Maybe that requirement is too strict?

Phase 3: Iteration loop

Refine eval prompts: Continuously adjust your prompts based on results until performance meets your standards.
1. Tip: You can add a few examples to your prompt of “good” and “bad” evals to ground the LLM response, as a form of “few-shot prompting.”
Expand your dataset: Regularly add new examples and edge cases to test whether your eval prompts can generalize effectively.
Iterate on your agent prompt: Evals help you test your product when you make changes to the underlying AI system—in some ways, they are the final boss when A/B testing prompts for your AI system. For example, when you make a change to an agent (e.g. changing the model from GPT-4o to Claude 3.7 Sonnet), you can rerun the dataset of questions you collected through your updated agent and evaluate the new output (i.e. Claude 3.7) with your eval agent. The goal would be to improve on your initial agent (GPT-4o) eval scores, giving you a benchmark you can use for continual improvement.

Phase 4: Production monitoring

Continuous evaluation: Set up evals to run automatically on live user interactions.
- For example: You can continuously run the “friendly” eval on all your incoming requests and agent responses, to get a score over time. This can help you answer questions such as “Are your users getting more frustrated over time?” or “Are the changes we are making to our system impacting how friendly our LLM is?”
Compare eval results to actual user outcomes: Look for discrepancies between eval results and real-world performance (i.e. human-labeled ground truth). Use these insights to enhance your eval framework and improve accuracy over time.
Build actionable eval dashboards: Evals can help communicate AI metrics to stakeholders across your team, and they can even be tied to business outcomes. They can serve as proxy leading metrics for changes you make to your system.

*Running your evaluation continuously on production data*

Common mistakes I’ve seen teams make when adopting evals:

Making evals too complex too quickly can create “noisy” signals (and cause the team to lose trust in the approach). Focus on specific outputs rather than complex evaluations—you can always add sophistication later.
Not testing for edge cases. Provide one or two specific examples of “good” and “bad” evals as part of your prompt—few-shot prompting—for increased eval performance. This helps ground the judge-LLM in what is considered good or bad.
Forgetting to validate eval results against real user feedback. Remember that you’re not just testing code, you’re validating if your AI can truly solve user problems.

Writing good evals forces you into the shoes of your user—they are how you catch “bad” scenarios and know what to improve on.

What’s next?

Now that you understand the fundamentals, here’s exactly how to start with evals in your own product:

Pick one critical feature of your AI product to evaluate. A common starting point is “hallucination detection” for a chatbot or agent that relies on documents or context you provide it with to answer questions. Try to tackle evaluating a well-defined component in your system before evaluating deeply internal logic.
Write a simple eval checking whether the LLM output correctly references provided content or if it invents (hallucinates) information.
Run your eval on 5 to 10 representative examples from real interactions that you have collected or created.
Review the results and iterate, refining the eval prompt until accuracy improves.

For a detailed example of how to build a hallucination eval, check out our guide here, as well as our hands-on course on Evaluating AI Agents.

Looking ahead

As AI products become more complex, the ability to write good evals will become increasingly crucial. Evals are not just about catching bugs; they help ensure that your AI system consistently delivers value and delights your users! Evals are a critical step in going from prototype to production with generative AI.

I would love to hear from you: How are you currently evaluating your AI products? What challenges have you faced? Leave a comment👇

📚 Further study

A conversation with OpenAI’s CPO Kevin Weil, Anthropic’s CPO Mike Krieger, and Sarah Guo
Peter Yang + Aman: The AI Skill That Will Define Your PM Career in 2025 | Aman Khan (Arize)
DeepLearning.ai course on evals with Andrew Ng
Prompt optimization guide: https://arize.com/course/prompt-optimization/
Arize eval hub (free!): https://arize.com/llm-evaluation
Peter Yang + Scott White: Exclusive: Inside the Best AI Model for Coding and Writing | Scott White (Anthropic)

Thank you, Aman!

Have a fulfilling and productive week 🙏

If you’re finding this newsletter valuable, share it with a friend, and consider subscribing if you haven’t already. There are group discounts, gift options, and referral bonuses available.

Sincerely,

Lenny 👋

A guest post by

Aman Khan

AI Product Guy

Invite your friends and earn rewards

If you enjoy Lenny's Newsletter, share it with your friends and earn rewards when they subscribe.

Invite Friends

Comment

Restack

Similar newsletters

There are other similar shared emails that you might be interested in: