Segmentation strategy

Week 4 Lesson 4 AI Evals for Product Dev
Shane Butler AI Analyst Lab

Which blocking metric fails first when multi-join quality degrades?

Blocking metrics from L4.2
  • SQL correctness (execution-based)
  • Chart validity
  • Answer completeness
If multi-join quality degrades...
Which metric fails first? How would you know without segment-level breakdowns?

86% aggregate correctness hid a 78% multi-join disaster

What the PM saw
SQL Correctness: 86%
Above 85% threshold
What the power users experienced
Simple queries: 96%
Multi-join: 78%
Advanced: 34%
Aggregate hid disaster

Without segmentation you ship broken experiences to minority segments

A search engine ships a ranking update: +2% overall relevance, but health queries see -4%
A code completion tool reports 27% acceptance overall, but only 12% for less-common languages
A fraud detection model achieves 99.2% overall accuracy, but only 94% for new merchants

Breaking down quality metrics by dimensions that matter for user value and business decisions

Aggregate metric
82%
Ask: 82% for whom?
By query_complexity
By domain
By user_role
Goal: find hidden quality gaps between segments + identify which segments drive metric movement

Good segment dimensions are things you can observe before the system runs, not outcomes

Good dimensions
query_complexity
domain
query ambiguity (how specific or vague the question is)
time period
Bad dimensions
query succeeded vs. failed
Circular — segmenting by the metric itself tells you nothing new

For each dimension, compute metrics separately with confidence intervals

Segment Rate Sample Size (n) 95% CI
Simple queries 0.96 600 [0.94, 0.98]
Multi-join 0.78 350 [0.73, 0.82]
Advanced 0.34 50 [0.21, 0.48]
The 95% CI tells you: if we ran this test 100 times, the true rate would fall in this range 95 times. n=50 has enormous CI — rate is unreliable. n=600 has tight CI — rate is trustworthy.

Rank segments by volume x severity — not all segments matter equally

Priority 1: Fix first
High volume, high severity
multi-join: n=350, 78%
Safety-critical override
Low volume, high severity
finance: n=80, 71%
Monitor
High volume, low severity
Backlog
Low volume, low severity

Before running the analysis: which segment will have the highest correctness? The lowest? Will the worst segment drive the aggregate?

Stop-and-Predict: Segment Correctness Rates
  1. Which segment will have the highest correctness rate?
  2. Which will have the lowest?
  3. Will the worst segment drive the aggregate metric significantly, or will small sample size limit impact? (Hint: the advanced segment is only 5% of all queries. How much can 5% move the overall number?)
  4. How wide will the confidence interval be — how uncertain are we about the real rate — for advanced (n~50) vs. simple (n~600)?
Write your predictions before running the next cell.

Aggregate SQL correctness: 82%, CI [0.79, 0.85]

SQL Correctness (aggregate)
82%
95% CI: [0.79, 0.85]
The aggregate is the starting point, not the conclusion.

Simple 96%, multi-join 78%, advanced 34% — the aggregate hid a disaster

Simple queries
96%
n=600
Multi-join queries
78%
n=350
Advanced queries
34%
n=50
Aggregate of 82% was hiding a catastrophe for advanced queries

Advanced segment n=50 has wide CI [0.21, 0.48] — true rate is unknown

Confidence Intervals by Segment
Simple [0.94, 0.98]
Multi-join [0.73, 0.82]
Advanced [0.21, 0.48]
0.0 0.25 0.5 0.75 1.0
n=50. True rate could be anywhere in this range. Even upper bound (48%) is unacceptable.

Segment SQL correctness by domain, identify the lowest-performing segment

Base version (all students)
  • Run worked example
  • Segment by domain
  • Interpret results
  • Run check at end (finance < 0.75)
  • Define new dimension
  • Compute segment metrics
  • Fill prioritization template
  • Assemble final schema
Extend version (DS/Eng)
  • Write segment computation from scratch
  • Define custom dimension requiring feature engineering
  • Compute 3+ quality dimensions
  • Check what happens when you combine two dimensions (e.g., are advanced finance queries even worse than advanced queries overall?)

Segment Prioritization Schema with volume x severity ranking

Decision-ready artifact — feeds into release criteria (L4.6)
Segment Metric Rate n 95% CI Volume x Severity Priority
advanced queries SQL correctness 0.34 50 [0.21, 0.48] HIGH 1
finance domain SQL correctness 0.71 80 [0.63, 0.79] HIGH (safety) 2
weekend queries SQL correctness 0.75 12 [0.43, 0.93] LOW (n<20) backlog

Over-segmentation, small-sample false alarms, segmenting by outcome, aggregating without understanding drivers

Over-segmentation
8 dimensions × 5 values each = up to 390,000 cross-segments, most n=1. No signal.
Small-sample false alarms
n=8, 75% failure rate, CI [35%, 97%]. True rate unknown.
Circular segmentation
Segmenting by "query succeeded vs. failed" is circular — you're segmenting by the thing you're measuring. Tells you nothing new. Segment by observable features instead.
Aggregating without drivers
'Finance is worse' reported, never investigated. Segmentation reveals patterns; driver analysis explains them (L4.5).

PM says 88% passes the 85% threshold — what do you tell them?

Scenario 1

Aggregate SQL correctness: 88%.

Segments: simple (n=200, 95%), multi-join (n=150, 84%), advanced (n=50, 62%).

PM says '88% is above our 85% threshold, let's ship.'

What do you tell them? What evidence from the segmentation analysis do you cite?

Segmentation Workflow: Define dimensions, compute metrics, prioritize segments

Phase 1: Define
Start with the number. Ask: for whom?
3-5 dimensions
query_complexity / domain / user_role
Avoid: 8+ dimensions
Phase 2: Compute
Segment-level metrics with CIs
Flag: n<20 unreliable
Phase 3: Prioritize
Volume x Severity ranking
Output: Segment Prioritization Schema
Feeds into release criteria (L4.6). Segmentation reveals patterns; driver analysis (L4.5) explains why.

Next: Driver analysis — figuring out what causes the quality gaps between segments

  • Figure out what's causing the quality gaps between segments
  • Identify which factors explain segment differences
  • Choose intervention targets based on explanatory power
Builds on: Segment Prioritization Schema from L4.4
AI Analyst Lab | AI Evals for Product Dev | Week 4 Lesson 4 | aianalystlab.ai