Platform Monitoring Lab

Week 5 Lesson 7 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

What was the key difference between gradual drift and sudden drift in how you would detect and respond to each?

Gradual Drift
Track quality trends over weeks to spot slow changes
Quality drifts from 0.82 → 0.78 over weeks without crossing thresholds
Sudden Drift
Threshold violation
Quality drops from 0.82 → 0.68 overnight, crossing blocking threshold

Your monitoring script operates on a 24-hour delay — users hit a broken system for 19 hours before you see the report

2 PM
sql_success_rate drops to 0.68
6 AM next day
Report generated
9 AM
PM reads email, investigates
19 hours — users experiencing failures
Batch monitoring is conceptually correct but operationally delayed

Batch-based monitoring has three operational limits — reporting delay, aggregate-only views, and no instant alerting

24-hour reporting delay
See yesterday's problems tomorrow
Aggregate-only views
Cannot drill down to individual failing requests without writing custom code
No instant alerting
Threshold violations require manual investigation after next report

Phase 1 — Log ingestion sends your AI system's request records to the monitoring platform

A trace = one logged AI request (the input, output, latency, and any errors)
AI Data Analyst v2
Production logger
Platform SDK
Send in 60-second batches
Platform Storage
Searchable trace database
Batching
60 seconds
Sampling
100% during ramp, 1-5% steady state
Retention
Configurable lookback period

Phase 2 — Dashboard configuration shows YOUR metrics from L4.6, not the platform's generic defaults

Platform Defaults
Model latency, Token count, Error rate
Generic LLM metrics that don't map to your release criteria
Your Configured Metrics
sql_success_rate (threshold: 0.75), P95 latency (threshold: 2500ms), Split by user type: Executive vs PM users
Custom metrics matching your metric specs from L4.6

Phase 3 — Alerting setup distinguishes blocking alerts that page immediately from awareness alerts that investigate during business hours

Blocking Metric Violation
Quality score (sql_success_rate) < 0.75 for 3+ hours
Page On-Call Immediately
PagerDuty, Slack, SMS
Awareness Alert
Cost per query +10% over 24h
Investigate During Business Hours
Email, Slack

Statistical drift detection catches gradual quality degradation that threshold-based alerts miss

Baseline Period
0.82
Days 1-7
Current Week
0.78
Days 15-21
Blocking Threshold
0.75
Never crossed
Drift score = 0.08 (threshold: 0.05) — quality pattern changed
Quality trend shifted meaningfully even though no daily value crossed the blocking threshold
Catches gradual degradation before threshold violations · Drift score = KL divergence (measures how different this week's pattern looks from baseline)

Should you alert on any drop below 0.82, or use a statistical threshold? If statistical, what confidence level and why?

Context
Baseline: sql_success_rate = 0.82
Normal day-to-day fluctuation: ±0.03
Write down your prediction before continuing

Use a statistical threshold to filter out normal noise — alert only when the drop is almost certainly real

Alert on Any Drop < 0.82
False alarms ~50% of the time due to normal fluctuation
Alert fatigue within days — team mutes notifications
Statistical Threshold (filter out noise)
Alert only when there's less than 5% (or 1%) chance it's random noise — triggers around 0.76-0.75
Distinguishes genuine degradation from normal noise
Recommendation
99% confidence threshold (~0.75) aligns with your blocking metric from L4.6

The platform demo uploads 100 traces, builds a time-series dashboard, and replaces 15 lines of matplotlib with a real-time chart

Upload Sample
100 logged AI requests (CSV or JSON)
Build Metric
sql_success_rate time series
Real-Time Chart
Browser-accessible
Same chart from L5.6 required 15-20 lines of matplotlib and ran as batch job

The drift detector flags quality pattern shifts before threshold violations — alerting when the drift score reaches 0.08 against a 0.05 threshold

Baseline Pattern
0.82
Days 1-7
Current Week
0.78
Days 15-21
Drift Score (KL divergence)
0.08
Threshold: 0.05 → ALERT
Quality pattern shifted even though daily values stayed above 0.75 blocking threshold
Compares this week's quality pattern to the first week when the system was stable
Catches gradual degradation before threshold violations

Configure metrics, dashboards, and alerts using the platform's web UI — focus on understanding what each option does and how it maps to L5.6 concepts

  • Create free-tier account (Arize or Braintrust)
  • Upload 100-trace sample
  • Build dashboard: sql_success_rate + avg(latency), segment by user_role
  • Configure alert: quality < 0.75 for 3h OR P95 latency > 3000ms for 1h
  • Trigger test alert
  • Configure drift detector (statistical, 7-day lookback, threshold 0.05)
  • Fill in Platform Monitoring Setup artifact

A production monitoring dashboard supporting ramp decisions in 30 seconds instead of 10 minutes

sql_success_rate (7-day)
0.83
Threshold: 0.75
P95 Latency (7-day)
2400ms
Threshold: 2500ms
Drift Status
0.02
NO DRIFT (threshold: 0.05)
Segment sql_success_rate
Executive 0.83
PM 0.84
Ramp decision evidence: 30 seconds instead of 10 minutes of pandas queries

Alert fatigue, platform lock-in without understanding, and sampling bias are the most common failure modes

Failure Mode What Happens Fix
Alert fatigue from tight thresholds 23 alerts in week 1 → team mutes Slack → real regression buried in noise Set wider thresholds (2-3x normal fluctuation), require metric to stay bad for 3+ hours before alerting
Platform lock-in without fallback Platform outage during deployment → no monitoring visibility for 6 hours Maintain L5.6 Python scripts as source of truth
Sampling bias at 1% Rare Executive user failures (0.5% of traffic) produce zero sampled traces Smart sampling: 100% of high-risk user groups, 1% of low-risk traffic

When alerts don't fire, when drift is real, and which features matter most for your use case

Scenario 1
sql_success_rate dropped from 0.82 to 0.79 over 3 hours. Alert threshold: 0.75. Why didn't the alert fire?
Scenario 2
Drift detector alerts on day 15. Investigation shows new query type became common. True drift or false positive?
Scenario 3
Choosing between Arize (strong drift detection) and Braintrust (eval integration). Primary need: catch regressions within 4 hours of deployment. Which feature matters most?

From trace ingestion to dashboard visualization to drift detection — all feeding the ramp decision

1
Trace Ingestion
Production logger → Platform SDK (batch 60s) → Trace storage. Sampling: 1-5% steady state, 100% during ramp.
2
Dashboard + Alerts
Your custom metrics (sql_success_rate, P95 latency) power real-time charts and the alert engine. Blocking = page immediately. Awareness = business hours.
3
Drift Detection
Statistical drift detector with 7-day lookback catches gradual degradation before threshold violations.
All three components feed evidence into the ramp decision — same decision from L5.6, now with real-time visibility and instant alerting

Next: Decision-making under uncertainty

When the data doesn't give you a clear answer, how do you decide to ship?
AI Analyst Lab | AI Evals for Product Dev | Week 5 Lesson 7 | aianalystlab.ai