Browse lessons

Week 1: Foundations and Economics

Week 2: Instrumentation and Reliability Engineering

Week 3: Rigorous Measurement of Output Success and Failure

Week 4: Metric Design and Business Outcome Linkage

Week 5: Pipelines, Experiments, and Continuous Validation

Week 6: Decision-Making and Organization

Week 5: Pipelines, Experiments, and Continuous Validation · Lesson 5.2

Test set strategy and dataset lifecycle

How do we iterate fast without overfitting our evaluation?

Retired course. Due to the fast pace of AI, this course was retired before full release. Exercises, datasets, and videos referenced in this lesson are not available. The slide content and frameworks remain free to study.

Slide 1 of 18

Reader Notes

This is Lesson 5.2, Dataset Lifecycle. In the previous lesson, the evaluation pipeline was built with six stages and a warehouse that stores every run with full metadata. That is a real achievement. But here is the question this lesson asks: what happens when the data flowing through that pipeline stops representing reality? All of that infrastructure assumes the evaluation dataset is valid, meaning it reflects the kinds of queries users actually ask. If the dataset drifts away from production, every result in the warehouse is measured against the wrong baseline. The metrics look great. The users have a different experience. This lesson covers the four-stage lifecycle framework that keeps evaluation datasets valid as the product and users evolve. Creation, development, test set management, and retirement: each stage has explicit transition gates. The lesson explains why splitting data before any metric work is non-negotiable, why versioning matters as much for datasets as it does for code, and how to detect when a test set has drifted too far from production. The five governing rules exist to prevent two failure modes: teaching to the test, where optimization of the judge for specific examples causes a loss of generalizability; and staleness, where the test set no longer matches what users actually do.

Go deeper with AI Analytics for Builders

5-week course: metrics, root cause analysis, experimentation, and storytelling. Think like a Product Data Scientist.

See full curriculum

Book 1-on-1 with Shane

30-minute AI evals Q&A. Talk through your specific evaluation challenges and get hands-on guidance.

Book 1-on-1 session

★

Finished all 36 lessons? Take the exam and get your free AI Evals certification.

→