Free Course + Certification

AI Evals for Product Development

36 lessons across 6 weeks. Learn to evaluate, measure, and ship AI-powered product features with confidence. Earn weekly quiz certificates and a final certification exam. Free and open.

Start Lesson 1.1

About this course: Due to the fast pace of AI, this course was retired before full release. Slides and speaker notes are complete for all 36 lessons, but exercises, datasets, and video walkthroughs referenced in some lessons are not available. If you're looking for the future of AI evals with live instruction, check out Automate AI Evals with Claude Code.

Week 1

Foundations and Economics

Why AI evaluation is fundamentally different from traditional testing, and how to reason about quality, cost, and reliability for AI-powered features.

Lesson 1.1 18 slides

What AI evaluation is and why it requires a different approach

Why don't the evaluation methods I already know work for AI-powered product features?

Start lesson →

Lesson 1.2 18 slides

Product evaluation framework for AI systems

What is the full evaluation surface for an AI feature, from inputs to user outcomes?

Start lesson →

Lesson 1.3 18 slides

Failure surfaces and annotation-based analysis

Where do AI systems break in practice, and how do we turn failures into structured evidence?

Start lesson →

Lesson 1.4 18 slides

Distributional thinking for AI quality

How do we reason about AI behavior when outputs vary across inputs, contexts, and users?

Start lesson →

Lesson 1.5 19 slides

The cost-latency-quality frontier

How do we reason about quality improvements that change cost and latency?

Start lesson →

✍

Week 1 Quiz

5 questions · Earn your weekly certificate

Take quiz →

Week 2

Instrumentation and Reliability Engineering

What to log, how to design traces for reproducibility, and how to build regression safety nets for non-deterministic systems.

Lesson 2.1 17 slides

What to log and why it matters for AI evaluation

What evidence do I need to capture so evaluation is possible and decisions are auditable?

Start lesson →

Lesson 2.2 18 slides

How little instrumentation is enough?

When is input/output logging enough, and when do I need deeper traces?

Start lesson →

Lesson 2.3 19 slides

Trace design and reproducibility

If we need to explain or reproduce a behavior, what must be true about our trace design?

Start lesson →

Lesson 2.4 19 slides

The regression safety net — evaluation in CI/CD

How do we prevent regressions before a deploy when outputs are non-deterministic?

Start lesson →

Lesson 2.5 17 slides

How instrumentation requirements differ by system type

What must be captured for this system type so we can diagnose the failures that matter?

Start lesson →

Lesson 2.6 19 slides

Designing instrumentation that scales with the product

How do we keep observability useful at scale without drowning in data?

Start lesson →

✍

Week 2 Quiz

5 questions · Earn your weekly certificate

Take quiz →

Week 3

Rigorous Measurement of Output Success and Failure

Ground truth strategies, similarity and retrieval metrics, semantic evaluation with human and model judges, and scaling judgment with statistical confidence.

Lesson 3.1 19 slides

Grounding evaluation in user value

What does good mean for this feature from the user's perspective, not the model's?

Start lesson →

Lesson 3.2 20 slides

Ground truth sources, regression suites, and synthetic data

What are we comparing against, and how do we cover the long tail?

Start lesson →

Lesson 3.3 16 slides

Deriving evaluation signals from available ground truth

Given imperfect ground truth, what signals can we compute and what are they for?

Start lesson →

Lesson 3.4 18 slides

Similarity metrics and retrieval-specific metrics

When can similarity approximate quality, and how do we isolate retrieval failures?

Start lesson →

Lesson 3.5 17 slides

Semantic metrics with human and model judges

When similarity is not enough, how do we measure whether the output helped the user?

Start lesson →

Lesson 3.6 18 slides

Scaling semantic evaluation with statistical confidence

How do we scale judgment while preserving measurement credibility?

Start lesson →

✍

Week 3 Quiz

5 questions · Earn your weekly certificate

Take quiz →

Week 4

Metric Design and Business Outcome Linkage

Building metric systems that support ship/hold/rollback decisions, validating metrics against outcomes, segmentation strategy, and driver analysis.

Lesson 4.1 17 slides

Metric strategy — blocking metrics vs optimization metrics

How do we build a metric system that supports decisions without breaking the link to user value?

Start lesson →

Lesson 4.2 16 slides

Metric design patterns for AI features

What metric archetype fits this feature and user workflow?

Start lesson →

Lesson 4.3 19 slides

Metric validation — correlation to outcomes and sensitivity to change

How do we know this metric is meaningful, not just convenient?

Start lesson →

Lesson 4.4 18 slides

Segmentation strategy for AI systems

Where does the system work well, where does it fail, and how do we structure segments to see it?

Start lesson →

Lesson 4.5 18 slides

Driver analysis — explaining variance and choosing what to change first

What is driving performance differences, and what should we change first?

Start lesson →

Lesson 4.6 18 slides

Metric specifications, thresholds, baselines, and release criteria

What does good enough to ship mean in metric terms, and how do we prevent regressions?

Start lesson →

✍

Week 4 Quiz

5 questions · Earn your weekly certificate

Take quiz →

Week 5

Pipelines, Experiments, and Continuous Validation

Operationalizing evaluation with pipelines, test set lifecycle management, experiment design for stochastic systems, and production monitoring.

Lesson 5.1 17 slides

Evaluation pipeline architecture and environments

How do we operationalize evaluation so it is repeatable, queryable, and tied to outcomes?

Start lesson →

Lesson 5.2 18 slides

Test set strategy and dataset lifecycle

How do we iterate fast without overfitting our evaluation?

Start lesson →

Lesson 5.3 19 slides

Experiment design for stochastic systems

How do we run online tests when outcomes are noisy, long-tailed, and mediated through user behavior?

Start lesson →

Lesson 5.4 19 slides

Launch readiness and rollout gates

What must be true before exposure, and what do we do if it degrades?

Start lesson →

Lesson 5.5 16 slides

Monitoring for drift and regressions

How do we detect silent failures in production without evaluating everything?

Start lesson →

Lesson 5.6 23 slides

Building evaluation automation end-to-end

How do we connect offline evaluation, CI/CD checks, and production monitoring into one system?

Start lesson →

Lesson 5.7 18 slides

Capstone lab — evaluation pipeline build

Can you build a working evaluation pipeline that runs sampling, judging, aggregation, and reporting?

Start lesson →

✍

Week 5 Quiz

5 questions · Earn your weekly certificate

Take quiz →

Week 6

Decision-Making and Organization

Making decisions under uncertainty, translating signals to actions, prioritization, ownership models, governance, and communicating AI product impact.

Lesson 6.1 18 slides