Monitoring for Drift

	Random Sampling	Signal-Based Filtering
Selection	100 traces uniformly selected	100 traces weighted by risk signals
Weights	Each trace has equal probability	Negative feedback: 3x more likely to be sampled High uncertainty: 2x Failure-prone segment: 2x
Passing Rate	85%	78%
Problem Detection	Low (misses edge cases)	High (oversamples problems)

Intent	Baseline (Days 1-14)	Monitoring (Days 22-28)	Change
lookup	25%	22%	-3pp
trend	30%	24%	-6pp
comparison	25%	35%	+10pp
aggregation	20%	19%	-1pp

Period	Correctness	95% Confidence Interval	Status
Days 1-7	94% (47/50)	[86%, 98%]	✓ Baseline
Days 8-14	92% (46/50)	[84%, 97%]	✓ Stable
Days 15-21	84% (42/50)	[73%, 92%]	⚠ Declining
Days 22-28	72% (36/50)	[59%, 83%]	✗ Alert

Alert fatigue and judge drift — two more failure modes that undermine monitoring confidence

Alert fatigue

PSI > 0.1 threshold too low → natural variation fires alerts weekly → Week 5 real drift (PSI=0.35) ignored because team stopped reading alerts

Judge changes hide real quality problems

Judge model upgraded, scores more leniently → passing rate rises 81% → 87% → real regression hides behind inflated scores

Prevention

Distinguish three threshold tiers: log (PSI > 0.1), alert (PSI > 0.2), rollback consideration (PSI > 0.25) · Pin judge to a specific dated version (e.g., gpt-4o-2024-11-20) so upgrades don't silently change scoring · Track sentinel agreement separately from system quality

Monitoring for Drift

How do you know whether your launch decision was correct?

You shipped v2 three weeks ago — offline metrics said it was better, but users are complaining

Signal-based filtering allocates your finite evaluation budget where it matters most

Input drift: the distribution of user queries changes over time

Output drift: model response patterns change even when inputs stay the same

Concept drift: the relationship between inputs and correct outputs changes

Judge drift: the LLM judge's scoring behavior changes over time

Monitoring validates your ship decision continuously — rollback triggers operationalize decision authority

Prediction: will signal-based sampling show a higher, lower, or same passing rate?

PSI = 0.32 from comparison query surge — input drift detected

Signal-based sampling produces 78% passing rate vs 85% random — this is the feature, not the bug

KS test detects SQL complexity shift — mean clause count increased from 2.3 to 2.8

Sentinel correctness dropped 22 percentage points — database schema change broke SQL generation

Build a complete monitoring plan with thresholds, owners, and rollback triggers

Four-quadrant drift dashboard catches quality degradation 3 days before aggregate metrics would show it

Over-weighted signals create sampling bias — phantom quality problems waste investigation cycles

Alert fatigue and judge drift — two more failure modes that undermine monitoring confidence

Knowledge Check 1: PSI = 0.18, comparison queries shifted +10pp — should you rollback, investigate, or take no action?

Knowledge Check 2: KS test p = 0.003 — does statistical significance mean you should act?

Knowledge Check 3: A colleague says signal-based sampling shows "true quality is 78%" — what's wrong?

Four-quadrant drift detection dashboard — PSI, KS, sentinel correctness, Kappa

Next: Platform Monitoring Lab