← All lessons
Browse lessons

Week 3: Rigorous Measurement of Output Success and Failure · Lesson 3.5

Semantic metrics with human and model judges

When similarity is not enough, how do we measure whether the output helped the user?

Retired course. Due to the fast pace of AI, this course was retired before full release. Exercises, datasets, and videos referenced in this lesson are not available. The slide content and frameworks remain free to study.

Slide 1 of 17

Reader Notes

This is Week 3, Lesson 5. Up to now, retrieval quality has been the evaluation focus: whether the right documents came back. But retrieval is only half the story. The easier half. This lesson evaluates the narrative that gets generated from those documents. That requires semantic evaluation, which means using an LLM as a judge. This is one of the most important lessons in the course because judge design is where most teams go wrong. They build a judge on day one, assume it works, and ship it. Then six months later they realize the judge was grading everything wrong. This lesson does it right. Retrieval metrics tell you if the system found the right documents. But the user never sees the documents. They see the narrative. If the narrative hallucinates or omits key findings, retrieval metrics won't catch it. That's why semantic evaluation is critical.

Go deeper with AI Analytics for Builders

5-week course: metrics, root cause analysis, experimentation, and storytelling. Think like a Product Data Scientist.

Book 1-on-1 with Shane

30-minute AI evals Q&A. Talk through your specific evaluation challenges and get hands-on guidance.

Finished all 36 lessons? Take the exam and get your free AI Evals certification.