Experiment Design for Stochastic Systems

Week 5 Lesson 3 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

What is the key risk when your evaluation dataset drifts from the production distribution, and how would you detect it?

Test Set Distribution
Queries from curated historical logs
Precision: 0.82
Production Distribution
Real user queries
Real user queries in production
Precision: 0.67
drift →

Offline evaluation improved the metric — can we ship to all users?

v1 Baseline
Original retrieval system
Pass Rate: 0.72
v2 Improved
Enhanced retrieval quality
Pass Rate: 0.79 ↑ +0.07
Does this mean we should ship v2 to everyone?
Not yet.

AI systems are unpredictable: same query, different results on different runs

Run 1 Run 2 Run 3
Query: 'Show sales by region' SELECT region, SUM(sales) FROM... SELECT region, total_sales FROM... GROUP BY 1 SELECT r.name, SUM(s.amount) FROM... JOIN...
Same input → different outputs → more noise in your experiment

Define what you're measuring and at what level before collecting data

What You're Measuring?
(the estimand)
Average treatment effect on sql_success_rate
At What Level?
(unit of randomization)
User-level (each user sees v1 or v2 for entire experiment)
User-level randomization avoids carryover effects but requires more users than query-level.

Primary, secondary, and guardrail metrics — all defined before the experiment

1
Primary: sql_success_rate
What you're trying to improve
2
Secondary: task_completion_rate
Supporting evidence that SQL improvement translates to user value
3
Guardrails: avg_latency_ms, avg_cost_usd
Must not degrade — blocking metrics from L4.6 (must pass to ship)

Which guardrail metric is most likely to regress when retrieval improves?

Two options:
avg_latency_ms
avg_cost_usd
Write your prediction before continuing.

Sample size calculation accounts for random variation in AI systems

Baseline rate: 0.65
Current sql_success_rate in production
Smallest improvement worth detecting (MDE): 0.03
You want to detect at least a 3 percentage point improvement
How noisy are your measurements?
AI systems have higher variance — more noise from random outputs
Required sample size per group: ~8,000 users
Higher variance → larger sample size

Decision rules written before seeing results prevent p-hacking

Outcome Rule
Ship Primary improves by MDE or more with p<0.05 AND all guardrails pass AND no segment regression
Ramp Primary directionally positive but CI too wide AND all guardrails pass
Hold CI too wide (underpowered) or mixed signals
Rollback Any guardrail fails OR primary degrades significantly
Write these BEFORE looking at results (prevents p-hacking — cherry-picking thresholds).

Primary metric improved +2.9pp but confidence interval crosses zero — not significant

Metric Control Treatment Effect [95% CI] p-value
sql_success_rate (primary) 0.65 0.679 +0.029 [-0.003, +0.061] 0.07
task_completion_rate (secondary) 0.58 0.60 +0.02 [-0.005, +0.045] 0.12
Directionally positive but CI crosses zero — cannot rule out no effect

Guardrails passed — but the primary metric didn't clear the bar

Guardrail Threshold Observed Change Status
avg_latency_ms Max +10% +7% PASS
avg_cost_usd Max +15% +12% PASS
Guardrails green, primary ambiguous
Both guardrails passed, but the primary metric isn't significant. Decision: hold for more data.

Design an experiment and analyze pre-computed results

Base Exercise
All students | 20-25 min

Run power analysis, compute treatment effects, check guardrails, fill in Experiment Design One-Pager
Extend Exercise
DS/Eng | +10-15 min

Segment-level analysis by user_role, check guardrails per segment, update one-pager with segment findings

Experiment Design One-Pager with evidence-based hold decision

Primary Metric: sql_success_rate +2.9pp [-0.3pp, +6.1pp]
Directionally positive but CI crosses zero — not statistically significant (p=0.07)
Guardrails: all PASS (latency +7%, cost +12%)
Both guardrails within acceptable thresholds
Decision
Hold — not significant
Next Action
Collect more data or investigate interference

Hypothetical: what if primary was significant but a guardrail failed?

What the team sees
+4pp ✓
sql_success_rate improved (hypothetical)
What the team ignores
+18% ✗
avg_cost_usd exceeded threshold
Guardrails exist to prevent tunnel vision.

Segment shows +5% for PMs but -3% for Engineers — do you ship?

Segment Treatment Effect
PM users +5%
Engineering users -3%
Overall effect is +2%. Do you ship? Why or why not?

Phase 1: Design — before touching data

Phase 1: Design
What + Who
Estimand + Unit
Metrics
Primary, secondary, guardrails
Sample Size
Power analysis
Stop Conditions
When to end
Decision Rules
Ship/ramp/hold/rollback
Write all of this BEFORE looking at results

Phase 2: Analysis — compute effects, check guardrails

Phase 2: Analysis
Treatment Effect + CI
Magnitude and confidence interval
Guardrail Check
PASS → continue | FAIL → STOP
Segment Analysis
Check for segment regressions
If any guardrail fails, STOP → consider rollback

Phase 3: Decision — ship, ramp, hold, or rollback

Phase 3: Decision
Ship
Primary ↑, guardrails pass, no segment regression
Ramp
Directionally positive, CI wide, guardrails pass
Hold
Underpowered or mixed signals
Rollback
Any guardrail fails OR primary degrades

Next: Interference-aware design

When one user's treatment affects another user's outcome (interference)
AI Analyst Lab | AI Evals for Product Dev | Week 5 Lesson 3 | aianalystlab.ai