← All lessons
Browse lessons

Week 5: Pipelines, Experiments, and Continuous Validation · Lesson 5.3

Experiment design for stochastic systems

How do we run online tests when outcomes are noisy, long-tailed, and mediated through user behavior?

Retired course. Due to the fast pace of AI, this course was retired before full release. Exercises, datasets, and videos referenced in this lesson are not available. The slide content and frameworks remain free to study.

Slide 1 of 19

Reader Notes

The previous lessons covered building offline evaluation metrics and curating test datasets. Now the question is: can this change be shipped to real users? Here is the gap. Offline metrics describe what happened on a test set. They do not describe what happens when real users query the system in unpredictable ways. Real users ask follow-ups. They reformulate when confused. They abandon when trust breaks. They paste unexpected data into the input field. None of that shows up in offline metrics. Consider the implications: a beautiful evaluation pipeline was built and run, the numbers looked great, and none of it captures what actually happens in production. This lesson teaches how to design experiments for AI systems, meaning systems where running the same query five times produces five different results. That randomness is what "stochastic" means. The experiments must account for that variability, so that ship decisions are backed by evidence, not hope.

Go deeper with AI Analytics for Builders

5-week course: metrics, root cause analysis, experimentation, and storytelling. Think like a Product Data Scientist.

Book 1-on-1 with Shane

30-minute AI evals Q&A. Talk through your specific evaluation challenges and get hands-on guidance.

Finished all 36 lessons? Take the exam and get your free AI Evals certification.