Platform Monitoring Lab

Segment	sql_success_rate
Executive	0.83
PM	0.84

Alert fatigue, platform lock-in without understanding, and sampling bias are the most common failure modes

Failure Mode	What Happens	Fix
Alert fatigue from tight thresholds	23 alerts in week 1 → team mutes Slack → real regression buried in noise	Set wider thresholds (2-3x normal fluctuation), require metric to stay bad for 3+ hours before alerting
Platform lock-in without fallback	Platform outage during deployment → no monitoring visibility for 6 hours	Maintain L5.6 Python scripts as source of truth
Sampling bias at 1%	Rare Executive user failures (0.5% of traffic) produce zero sampled traces	Smart sampling: 100% of high-risk user groups, 1% of low-risk traffic