Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 ev…
Also on Spotify and Apple Podcasts Jump to the best parts:
Brought to you by:
Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, walk through real examples, and share practical techniques that’ll help you improve your AI product. What you’ll learn:
Where to find Shreya Shankar• LinkedIn: https://www.linkedin.com/in/shrshnk/ • Website: https://www.sh-reya.com/ • Maven course: https://bit.ly/4myp27m Where to find Hamel Husain• X: https://x.com/HamelHusain • LinkedIn: https://www.linkedin.com/in/hamelhusain/ • Website: https://hamel.dev/ • Maven course: https://bit.ly/4myp27m In this episode, we cover:(00:00) Introduction to Hamel and Shreya (04:57) What are evals? (09:56) Demo: Examining real traces from a property management AI assistant (16:51) Writing notes on errors (23:54) Why LLMs can’t replace humans in the initial error analysis (25:16) The concept of a “benevolent dictator” in the eval process (28:07) Theoretical saturation: when to stop (31:39) Using axial codes to help categorize and synthesize error notes (44:39) The results (46:06) Building an LLM-as-judge to evaluate specific failure modes (48:31) The difference between code-based evals and LLM-as-judge (52:10) Example: LLM-as-judge (54:45) Testing your LLM judge against human judgment (01:00:51) Why evals are the new PRDs for AI products (01:05:09) How many evals you actually need (01:07:41) What comes after evals (01:09:57) The great evals debate (1:15:15) Why dogfooding isn’t enough for most AI products (01:18:23) OpenAI’s Statsig acquisition (1:23:02) The Claude Code controversy and the importance of context (01:24:13) Common misconceptions around evals (1:22:28) Tips and tricks for implementing evals effectively (1:30:37) The time investment (1:33:38) Overview of their comprehensive evals course (1:37:57) Lightning round and final thoughts Watch or listen now: LLM Log Open Codes Analysis Prompt:
Referenced:• Building eval systems that improve your AI product: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve • Mercor: https://mercor.com/ • Brendan Foody on LinkedIn: https://www.linkedin.com/in/brendan-foody-2995ab10b • Nurture Boss: https://nurtureboss.io/ • Braintrust: https://www.braintrust.dev/ • Andrew Ng on X: https://x.com/andrewyng • Carrying Out Error Analysis: https://www.youtube.com/watch?v=JoAxZsdw_3w • Julius AI: https://julius.ai/ • Brendan Foody on X—“evals are the new PRDs”: https://x.com/BrendanFoody/status/1939764763485171948 • Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://dl.acm.org/doi/abs/10.1145/3654777.3676450 • Lenny’s post on X about evals: https://x.com/lennysan/status/1909636749103599729 • Statsig: https://statsig.com/ • Claude Code: https://www.anthropic.com/claude-code • Cursor: https://cursor.com/ • Occam’s razor: https://en.wikipedia.org/wiki/Occam%27s_razor • Frozen: https://www.imdb.com/title/tt2294629/ • The Wire on HBO: https://en.wikipedia.org/wiki/The_Wire Recommended books:• Pachinko: https://www.amazon.com/Pachinko-National-Book-Award-Finalist/dp/1455563935 • Apple in China: The Capture of the World’s Greatest Company: https://www.amazon.com/Apple-China-Capture-Greatest-Company/dp/1668053373/ • Machine Learning: https://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/1259096955 • Artificial Intelligence: A Modern Approach: https://www.amazon.com/Artificial-Intelligence-Modern-Approach-Global/dp/1292401133/ Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com. Lenny may be an investor in the companies discussed. My biggest takeaways from this conversation:...Subscribe to Lenny's Newsletter to unlock the rest.Become a paying subscriber of Lenny's Newsletter to get access to this post and other subscriber-only content. A subscription gets you:
|
Similar newsletters
There are other similar shared emails that you might be interested in: