Launch Readiness

Week 5 Lesson 5 — AI Evals for Product Dev
Shane Butler · AI Analyst Lab

What did the experiment show about the treatment effect on SQL success rate, and did the guardrail metrics pass or fail?

Treatment effect
 
Guardrail status
 

Most AI launches fail not because the model is bad, but because the infrastructure isn't ready

Experiment passed
+4.8pp SQL success (switchback), guardrails green
Engineering asks 3 questions that stop the conversation
 

Is monitoring infrastructure ready to detect silent failures? What if quality degrades 5% overnight — do you have a runbook? Have you validated that offline metrics predict what users experience?

Launch readiness is a sequence of gates with entry and exit criteria at each stage

Shadow
0% users
Canary
1-5%
Ring 1
10-25%
Full Rollout
100%
Entry criteria
What must be true to proceed (e.g., monitoring is live, shadow passed)
Exit criteria
What would trigger rollback or hold (e.g., cost spikes 20%, latency exceeds threshold)

Shadow deployment validates offline-online agreement without user exposure

Offline Eval (Judge Pipeline)
Judge-based SQL success rate
Shadow (Production)
Offline SQL (judge) · Oracle correctness (execution) · Online task completion (behavior)
Gap between offline metrics and real user behavior
The difference between what your eval pipeline measures (offline) and what users actually experience (online)

Three dimensions — technical validation, operational readiness, decision governance

1
Technical Validation
Offline metrics, experiment results, shadow agreement, canary validation
2
Operational Readiness
Monitoring alerts, on-call rotation, runbooks, cost budget, compliance
3
Decision Governance
Who approves, what evidence required, what triggers rollback
Rollout Decision Document
Synthesizes all three dimensions into a single artifact

What is the most likely way shadow results will differ from your offline evaluation results?

Write your prediction and explain what causes the gap between offline eval and production behavior.

Offline SQL success is 79%, online task completion is 71% — 8-point gap means offline metrics overestimate quality

Metric Value 95% CI
Offline SQL success (judge pipeline) 0.79 [0.74, 0.84]
Online task completion (user behavior) 0.71 [0.65, 0.77]
Gap between offline and online quality 0.08 [0.02, 0.14]
The CI excludes zero — the gap is real, not noise.

Oracle validation (74%) falls between judge (79%) and online proxy (71%) — three metrics, three values

Judge-based
79%
Execution-based
74%
Behavior-based
71%
Judge-based (LLM scores SQL correctness)
Execution-based (Oracle queries validate results)
Behavior-based (User task completion in production)

Canary gate check — four blocking metrics with thresholds and pass/warn/fail status

Metric Observed 95% CI Threshold Status
SQL success rate 0.76 [0.71, 0.81] ≥0.75 PASS
Avg latency (ms) 1850 [1720, 1980] ≤2000 PASS
Avg cost (USD) $0.14 [$0.12, $0.16] ≤$0.15 PASS
Safety violation rate 0.008 [0.004, 0.012] ≤0.01 PASS

Build the complete Rollout Decision Document with entry criteria, rollback triggers, and operational readiness checklist

Base version (20-25 min)
  • Compute canary blocking metrics with CIs
  • Run canary gate check (pass/warn/fail)
  • Compare canary to experiment results
  • Produce gate status dashboard
  • Define entry criteria for all 4 stages
  • Define rollback triggers
  • Build operational readiness checklist
  • Write Rollout Decision Document (ship/ramp/hold recommendation with supporting evidence)
Extend version (DS/Eng, +10-15 min)
  • Compute offline-online agreement
  • Investigate which queries drive disagreement
  • Sensitivity analysis on rollback triggers
  • Propose 3 additional monitoring signals

Gate status dashboard showing canary blocking metrics with color-coded pass/warn/fail and recommendation

Canary Gate Status — v2 AI Data Analyst
0.76
PASS
RAMP
Proceed to Ring 1 (25% traffic) with enhanced monitoring. Offline-online gap of 7% warrants segment-level validation.
Portfolio-ready artifact

Ignoring offline-online gaps, skipping rollback triggers, treating shadow as launch approval, operational readiness as afterthought

Ignore offline-online agreement gaps
Offline says 79%, online is 71% — ship based on inflated confidence
Skip rollback trigger definition
Cost spikes 20% in week 2 — PM, eng lead, and finance argue while cost accumulates
Treat shadow success as launch approval
Shadow looks good, skip canary — miss verbose narratives confusing Executive users
Operational readiness as afterthought
2am latency spike — on-call doesn't know how to diagnose or roll back

Shadow shows 82% offline but 74% online — proceed or hold? Canary cost spikes 18% — what do you do?

Scenario 1
Shadow stage results: offline SQL success rate (judge pipeline) is 82%, online task completion rate (user behavior) is 74%. What does this 8-point gap tell you about your offline metric? Should you proceed to canary, or hold and investigate the gap first?
Scenario 2
During canary (5% traffic), avg_cost_usd increases by 18% compared to baseline. Your L4.6 metric spec defined cost as an optimization metric (not blocking), but your CFO notices the spike. Do you proceed to ring 1, ramp more slowly, or roll back?

Progressive rollout: Shadow → Canary → Ring 1 → Full Rollout with entry criteria and rollback path

Shadow (0%)
Run v2 in prod, don't serve outputs
Entry: offline eval passed
Exit: metrics match offline
Canary (1-5%)
Expose to small segment
Entry: shadow passed, monitoring live
Exit: all blocking metrics pass
Ring 1 (10-25%)
Expand to larger segment
Entry: canary passed 3+ days
Exit: all segments validated
Full Rollout (100%)
Complete migration
Entry: ring1 passed 1+ week
Drift monitoring active
ROLLBACK path from any stage
Blocking metric fails OR cost overrun OR user complaints spike

Next: Monitoring for Drift

You shipped. Now what? Lesson 5.6 shows you how to monitor production quality without evaluating every query, detect silent failures before users complain, and build drift alerts that fire only when quality actually degrades.
AI Analyst Lab | AI Evals for Product Dev | Week 5 Lesson 5 | aianalystlab.ai