1. Measuring at the wrong granularity
Output-level metrics miss session-level failures
2. Treating all metrics as blocking
Team delays ship for verbosity while SQL errors persist
3. Aggregating incorrectly across sessions
Averaging dilutes single failures
4. Mismatching archetype to feature type
Extraction metrics miss faithfulness
5. Setting thresholds without failure cost analysis
95% sounds good but costs $100K/day at volume