Browse lessons

Week 1: Foundations and Economics

Week 2: Instrumentation and Reliability Engineering

Week 3: Rigorous Measurement of Output Success and Failure

Week 4: Metric Design and Business Outcome Linkage

Week 5: Pipelines, Experiments, and Continuous Validation

Week 6: Decision-Making and Organization

Week 3: Rigorous Measurement of Output Success and Failure · Lesson 3.4

Similarity metrics and retrieval-specific metrics

When can similarity approximate quality, and how do we isolate retrieval failures?

Retired course. Due to the fast pace of AI, this course was retired before full release. Exercises, datasets, and videos referenced in this lesson are not available. The slide content and frameworks remain free to study.

Slide 1 of 18

Reader Notes

This is Week 3, Lesson 4. The focus shifts from end-to-end quality metrics to component-level diagnostics. Specifically: how to measure whether the retrieval system is surfacing the right context before generation even starts. This matters because a broken retrieval layer will corrupt every downstream component, and it needs to be caught before users complain. Consider the cascade. The AI Data Analyst retrieves schema context, feeds it to SQL generation, executes the query, and produces a narrative. If retrieval hands over the wrong documents, every stage after that faithfully processes incorrect context into a polished but wrong answer. End-to-end metrics might still look fine: the SQL executes, the narrative is well-written. But the user gets wrong numbers. That's the silent failure that retrieval metrics exist to catch. By the end of this lesson, the right retrieval metric for a given product will be clear (precision, recall, MRR, or NDCG), along with how to compute each one, and how to connect retrieval failures to downstream quality drops. The right metric depends on what the product does with the retrieved documents, and that's a product decision, not a technical default. This lesson sits between L3.3 (signal categories) and L3.5 (semantic metrics with judges).

Go deeper with AI Analytics for Builders

5-week course: metrics, root cause analysis, experimentation, and storytelling. Think like a Product Data Scientist.

See full curriculum

Book 1-on-1 with Shane

30-minute AI evals Q&A. Talk through your specific evaluation challenges and get hands-on guidance.

Book 1-on-1 session

★

Finished all 36 lessons? Take the exam and get your free AI Evals certification.

→