Evals, error analysis, and better prompts: A systematic approach to improving your AI products | Hamel Husain (ML …
- Lenny's Newsletter <lenny+how-i-ai@substack.com>
- Hidden Recipient <hidden@emailshot.io>
Evals, error analysis, and better prompts: A systematic approach to improving your AI products | Hamel Husain (ML engineer)🎙️ How to build better AI products through data-driven error analysis, evaluation frameworks, and systematic quality improvement
Why is this in your inbox? Because How I AI, hosted by Claire Vo, is part of the Lenny’s Podcast Network. Every Monday, we share a 30- to 45-minute episode with a new guest demoing a practical, impactful way they’ve learned to use AI in their work or life. No pontificating—just specific and actionable advice. Prefer to skip future episode drops? Unsubscribe from How I AI podcast notifications here. Brought to you by:GoFundMe Giving Funds—One account. Zero hassle. Persona—Trusted identity verification for any use case Hamel Husain, an AI consultant and educator, shares his systematic approach to improving AI product quality through error analysis, evaluation frameworks, and prompt engineering. In this episode, he demonstrates how product teams can move beyond “vibe checking” their AI systems to implement data-driven quality improvement processes that identify and fix the most common errors. Using real examples from client work with Nurture Boss (an AI assistant for property managers), Hamel walks through practical techniques that product managers can implement immediately to dramatically improve their AI products. What you’ll learn:
Where to find Hamel Husain:Website: https://hamel.dev/ Twitter: https://twitter.com/HamelHusain Course: https://maven.com/parlance-labs/evals GitHub: https://github.com/hamelsmu Where to find Claire Vo:ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ In this episode, we cover:(00:00) Introduction to Hamel Husain (03:05) The fundamentals: why data analysis is critical for AI products (06:58) Understanding traces and examining real user interactions (13:35) Error analysis: a systematic approach to finding AI failures (17:40) Creating custom annotation systems for faster review (22:23) The impact of this process (25:15) Different types of evaluations (29:30) LLM-as-a-Judge (33:58) Improving prompts and system instructions (38:15) Analyzing agent workflows (40:38) Hamel’s personal AI tools and workflows (48:02) Lighting round and final thoughts Tools referenced:• Claude: https://claude.ai/ • Braintrust: https://www.braintrust.dev/docs/start • Phoenix: https://phoenix.arize.com/ • AI Studio: https://aistudio.google.com/ • ChatGPT: https://chat.openai.com/ • Gemini: https://gemini.google.com/ Other references:• Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://dl.acm.org/doi/10.1145/3654777.3676450 • Nurture Boss: https://nurtureboss.io • Rechat: https://rechat.com/ • Your AI Product Needs Evals: https://hamel.dev/blog/posts/evals/ • A Field Guide to Rapidly Improving AI Products: https://hamel.dev/blog/posts/field-guide/ • Creating a LLM-as-a-Judge That Drives Business Results: https://hamel.dev/blog/posts/llm-judge/ • Lenny’s List on Maven: https://maven.com/lenny Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email jordan@penname.co. You're currently a free subscriber to Lenny's Newsletter. For the full experience, upgrade your subscription. |
Similar newsletters
There are other similar shared emails that you might be interested in:
- How this former NYT columnist uses ChatGPT to brainstorm ideas, do research, and find the perfect metaphor | Farha…
- Vibe coding a 3D multiplayer game in 15 minutes—with no game dev experience | Cody De Arkland (Senior Director of …
- A designer’s guide to AI: Building in Cursor (instead of Figma) lets you prototype 10x faster, simplifies collabor…