Monitoring for Drift
Week 5 Lesson 6 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab
How do you know whether your launch decision was correct?
Launch Readiness (L5.5)
Evidence Sufficiency
Ship Decision Memo
You shipped v2 three weeks ago — offline metrics said it was better, but users are complaining
Pre-Launch Evidence
✓ Must-pass metrics cleared
✓ Experiment confirmed improvement
✓ Staged rollout: no regressions
✓ Ship decision memo: approved
Week 3 Post-Launch
⚠ User complaints
⚠ CS escalations: "used to work"
⚠ Unusual SQL patterns detected
⚠ "Are we sure v2 is still working?"
The Monitoring Problem
50,000 queries/day · Evaluating all = $2,000/day + 800ms latency · You sample 100 traces (0.2%) — and miss the silent failures.
Signal-based filtering allocates your finite evaluation budget where it matters most
Random Sampling
Signal-Based Filtering
Selection
100 traces uniformly selected
100 traces weighted by risk signals
Weights
Each trace has equal probability
Negative feedback: 3x more likely to be sampled
High uncertainty: 2x
Failure-prone segment: 2x
Passing Rate
85%
78%
Problem Detection
Low (misses edge cases)
High (oversamples problems)
Signal-based filtering = prioritize traces with warning signs (negative feedback, low confidence scores, historically tricky query types). Same budget, smarter allocation.
Input drift: the distribution of user queries changes over time
Baseline (Days 1-14)
25%
lookup
30%
trend
25%
compare
20%
agg
Monitoring (Days 22-28)
22%
lookup
24%
trend
35%
compare
19%
agg
PSI (Population Stability Index) = 0.32 — major drift
PSI measures how much a distribution shifted: < 0.1 stable | 0.1-0.25 minor | > 0.25 major drift · +10 percentage point shift in comparison queries
Output drift: model response patterns change even when inputs stay the same
Baseline (Days 1-14)
SQL clause count
Mean:
2.3
Std:
1.1
Monitoring (Days 22-28)
SQL clause count
Mean:
2.8
Std:
1.4
KS test (compares two distributions): p = 0.003 — significant output drift
At alpha = 0.01 (stricter than the common 0.05), p < 0.01 means the output distribution shifted in a way that's very unlikely due to random chance. System generating more complex SQL.
Concept drift: the relationship between inputs and correct outputs changes
User query
"What was revenue last quarter?"
(unchanged)
Database schema
Before:
revenue_usd
After: CHANGED
total_revenue
System SQL generation
✗ Sentinel set detects 22 percentage point correctness drop
Sentinel set = a small set of test queries with known correct answers, used to continuously check if the system still works correctly (like a canary in a coal mine).
Judge drift: the LLM judge's scoring behavior changes over time
Cohen's Kappa — measures how consistently the judge agrees with humans (0 = random, 1 = perfect; ≥ 0.7 = acceptable)
Kappa = 0.7 threshold
0.78
Week 1
0.76
Week 2
0.74
Week 3
0.72
Week 4
No judge drift detected
Kappa ≥ 0.7 all weeks. Quality changes are real system changes, not scoring artifacts.
Monitoring validates your ship decision continuously — rollback triggers operationalize decision authority
Ship Decision
(L5.5)
Evidence: experiment
Must-pass metrics: passed
Assumption: Week 1 query mix
→
Monitoring
(L5.6)
Continuously test:
"v2 still better than v1"
Detect drift (input/output/concept/judge)
→
Rollback
Decision
IF drift exceeds thresholds
THEN execute response plan
(investigate, ramp down, or rollback)
Owner: Eng Manager · SLA: 1-4hr
Monitoring as ship decision validation
Without monitoring, ship decision is one-time judgment. With monitoring, it's a continuously validated hypothesis.
Prediction: will signal-based sampling show a higher, lower, or same passing rate?
Signal-based sampling intentionally selects more traces with negative feedback (3x more likely) and high uncertainty (2x more likely). Random sampling showed 85% passing. What will the weighted sample show?
HIGHER
(> 85%)
LOWER
(< 85%)
SAME
(≈ 85%)
Choose one before continuing. Write your prediction and reasoning.
PSI = 0.32 from comparison query surge — input drift detected
Intent
Baseline (Days 1-14)
Monitoring (Days 22-28)
Change
lookup
25%
22%
-3pp
trend
30%
24%
-6pp
comparison
25%
35%
+10pp
aggregation
20%
19%
-1pp
PSI = 0.32 > 0.25 major drift threshold
Not a system failure — a change in how users use the system. Likely seasonal business reporting cycle (end of quarter).
Decision: Investigate retrieval and SQL generation quality for comparison queries. Do not rollback system-wide.
Signal-based sampling produces 78% passing rate vs 85% random — this is the feature, not the bug
Random Sampling
100 traces uniformly selected
85%
Reflects population average
Signal-Based Sampling
100 traces weighted by risk
Negative feedback: 3x · High uncertainty: 2x
78%
Weighted sample focuses on high-risk cases
Lower rate = higher problem detection
Weighted sample is intentionally focused on problems. Lower passing rate means you're finding failures that matter. Track trends over time, not absolute values.
KS test detects SQL complexity shift — mean clause count increased from 2.3 to 2.8
Baseline (Days 1-14)
Mean:
2.3
Std:
1.1
→
Monitoring (Days 22-28)
Mean:
2.8
Std:
1.4
KS statistic = 0.18 (higher = more different) · p-value = 0.003
At alpha = 0.01, p < 0.01 → significant output drift
System generating more complex SQL. Could be input-driven (more comparison queries need multi-table JOINs) or a model behavior change.
Sentinel correctness dropped 22 percentage points — database schema change broke SQL generation
Period
Correctness
95% Confidence Interval
Status
Days 1-7
94% (47/50)
[86%, 98%]
✓ Baseline
Days 8-14
92% (46/50)
[84%, 97%]
✓ Stable
Days 15-21
84% (42/50)
[73%, 92%]
⚠ Declining
Days 22-28
72% (36/50)
[59%, 83%]
✗ Alert
Root cause: Schema update revenue_usd → total_revenue
Sentinel queries reference old column name. Expected breakage from known change.
Action: Update sentinel queries, coordinate with schema team, re-run check.
Build a complete monitoring plan with thresholds, owners, and rollback triggers
1. Sampling strategy (budget, signal weights, coverage)
2. Drift detection methods (PSI for input, KS test for output, sentinel for concept, Kappa for judge)
3. Alert thresholds (PSI > 0.25, p < 0.01, sentinel drop > 10pp, Kappa < 0.7)
4. Rollback decision tree (conditions → actions → owners → SLA)
5. Response workflow (investigate, escalate, rollback, post-incident)
Base (20-25 min)
Interpret PSI, set thresholds, apply decision tree, complete template
Extend (+10-15 min)
Implement PSI from scratch, add semantic drift detection, optimize sampling weights
Four-quadrant drift dashboard catches quality degradation 3 days before aggregate metrics would show it
Production Monitoring Dashboard: AI Data Analyst v2 (Days 1-30)
PSI (input drift)
■
Week 4: 0.32
KS p-value (output drift)
■
Week 4: 0.003
Sentinel correctness (concept)
■
Weeks 3-4: 72%
Sentinel Kappa (judge)
■
All weeks: above 0.7
What I built: monitoring plan with signal-based sampling, multi-dimensional drift detection (PSI=0.32, p=0.003, 22pp drop), rollback triggers — caught 15% quality degradation 3 days early.
Over-weighted signals create sampling bias — phantom quality problems waste investigation cycles
Sampling bias
Negative feedback weight too high (10x instead of 3x) → weighted sample dominated by worst cases → passing rate 65% (biased estimate) vs 82% (true population rate) → team investigates phantom problem
Prevention
Calibrate weights on historical data where you know the actual correct answers · Set max weight limits (≤ 5x) · Track divergence between weighted and random sample rates
Alert fatigue and judge drift — two more failure modes that undermine monitoring confidence
Alert fatigue
PSI > 0.1 threshold too low → natural variation fires alerts weekly → Week 5 real drift (PSI=0.35) ignored because team stopped reading alerts
Judge changes hide real quality problems
Judge model upgraded, scores more leniently → passing rate rises 81% → 87% → real regression hides behind inflated scores
Prevention
Distinguish three threshold tiers: log (PSI > 0.1), alert (PSI > 0.2), rollback consideration (PSI > 0.25) · Pin judge to a specific dated version (e.g., gpt-4o-2024-11-20) so upgrades don't silently change scoring · Track sentinel agreement separately from system quality
Knowledge Check 1: PSI = 0.18, comparison queries shifted +10pp — should you rollback, investigate, or take no action?
Scenario
Monitoring reports PSI = 0.18 on query intent distribution. Baseline: 25% lookup, 30% trend, 25% comparison, 20% aggregation. Current: 28% lookup, 22% trend, 35% comparison, 15% aggregation. Rollback threshold: PSI > 0.25.
ROLLBACK
PSI approaching threshold
INVESTIGATE
Minor drift, monitor comparison query quality
NO ACTION
Within threshold
Which category shifted most? What would you investigate first?
Knowledge Check 2: KS test p = 0.003 — does statistical significance mean you should act?
Scenario
A KS test comparing SQL JOIN counts between baseline (mean 2.3, std 1.1) and current period (mean 2.8, std 1.4) yields p = 0.003. Your alpha threshold is 0.01.
Statistical significance
p < alpha means the shift is unlikely due to chance
Practical significance
Does the size of the shift actually matter to users?
Is a mean increase from 2.3 to 2.8 clauses a big deal in practice? What action would you recommend?
Knowledge Check 3: A colleague says signal-based sampling shows "true quality is 78%" — what's wrong?
Scenario
Signal-based sampling with negative feedback weight 5x and segment risk 2x produces a passing rate of 78%. Random sampling produces 85%. A colleague argues the signal-based sample is more accurate and the system's true quality is 78%.
Report 78%
The enriched sample is more informative
Report 85%
Random sampling reflects true population
Report both
Different views for different questions
What is wrong with your colleague's reasoning?
Four-quadrant drift detection dashboard — PSI, KS, sentinel correctness, Kappa
Production Monitoring Dashboard: AI Data Analyst v2 (Days 1-30)
PSI (input drift)
Week 4:
0.32
(major)
KS p-value (output drift)
Week 4:
0.003
(significant)
Sentinel correctness (concept drift)
Weeks 3-4:
72%
(22pp drop)
Sentinel Kappa (judge drift)
All weeks:
0.72-0.78
(stable)
Legend:
■
within threshold
■
alert triggered
Three drift types triggered alerts in the same monitoring window. Multiple alert markers signal that several types of drift are happening simultaneously — requiring escalation.
Next: Platform Monitoring Lab
Week 5 · Lesson 7
→
Implement the drift dashboard in Arize (optional platform lab)
→
Compare DIY monitoring vs platform-managed monitoring
→
Decision criteria: when to build vs buy monitoring infrastructure
Same monitoring concepts, production-grade tooling.