Lesson 4.6 — Metric specs + release criteria

Week 4 · Lesson 6 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

What was the strongest driver you identified, and how did knowing that driver change what intervention you would recommend first?

Take 30 seconds to answer from memory.

Your PM asks: "Are we shipping?" but nobody has a framework for resolving metric conflicts

Engineering
"74% SQL success is great"
Design
"0.79 narrative quality is a regression"
Data Science
"Latency increase might breach p95 SLA"
You have measurements but no specifications.

A metric without a specification is a number on a dashboard — with a spec it's a contract

Before
Dashboard showing: SQL Success: 74%, Narrative Quality: 0.79 — no context, no thresholds
After
sql_success_rate
Definition: successful executions / total attempts
Population: logged-in users, exclude test accounts
Threshold: >= 60% (absolute floor)
Owner: Product

Metric specs require 8 components: definition, unit, population, segments, sampling, aggregation, ownership, versioning

Definition
Exact calculation formula
Unit of analysis
What you count: each question separately, or grouped per user?
Population
Which traces included
Segments
Which breakdowns matter
Sampling
Full population or sample (and why you might not evaluate every single trace)
Aggregation
How you summarize: average, worst-case (P95), or success-on-first-try?
Ownership
Pipeline, definition, investigation
Versioning
Refresh cadence, change tracking

Release criteria classify metrics as blocking, optimization, or guardrail — and define thresholds

Blocking
Must pass or release stops

Examples: SQL success >= 60%, narrative quality delta < 5%
Guardrail
Blocking metrics that prevent harm

Examples: cost per query < +15%, p95 latency < 2000ms
Optimization
Tracked but not gating

Examples: chart rendering speed, query parse time

Four threshold patterns: absolute floors, delta thresholds, confidence-aware, segment-aware

Pattern Example Purpose
Absolute floor SQL success >= 60% Minimum usability standard
Delta threshold Narrative quality regression < 5% Prevent meaningful regressions
Confidence-aware Statistical check confirms improvement is real, not random noise Ensure improvement isn't noise
Segment-aware No segment < 50% Prevent hiding broken segments

Your v1 baseline is 68% [65%-71%]. What threshold would you set and why?

68%
v1 SQL Success Rate
95% CI: [65% - 71%]
Question: What threshold: 68%, 65%, 70%, or something else?

Sub-question: Baseline-relative or absolute? How does CI affect your choice?

Write your prediction before continuing.

Set thresholds before evaluating the candidate model — looking at v2 first is p-hacking

✓ Correct
1. Compute v1 baselines [65%-71%]
2. Set threshold at 60% (below CI)
3. Evaluate v2
4. Compare v2 (74%) to threshold → Pass
✗ Incorrect
1. Evaluate v2 (74%)
2. Set threshold at 72%
3. v2 passes (guaranteed!)
Thresholds come from baselines and business requirements — never from candidate performance

Executive users at 55%, ambiguous queries at 45% — segment baselines drive segment thresholds

By User Role
Executive55% [49%-61%]
PM72% [68%-76%]
DS71% [66%-76%]
By Query Complexity
Ambiguous45% [38%-52%]
Moderate68% [63%-73%]
Simple82% [78%-86%]
Overall: 68% [65%-71%] — but segments reveal where the system breaks

Execution-based evaluation: did the SQL return the right answer, not just run without error?

Execution-based (sql_success)
"Did SQL run without error?"

Baseline: 68%
Limitation: SQL can execute but return wrong data
Correctness check (using known right answers)
"Did SQL return the expected result set?"

Baseline: 61%
Requires: A set of test questions where you already know the right answer

Oracle correctness is stricter than execution success

60% absolute floor (below CI), 5% delta for narrative quality, 50% segment floor

Absolute Floor
Metric: sql_success_rate
v1 baseline: 68% [65%-71%]
Threshold: >= 60%
Below lower CI bound — business minimum, not statistical artifact
Delta Threshold
Metric: narrative_quality
v1 baseline: 0.82 [0.79-0.85]
Threshold: >= 0.779 (0.82 × 0.95)
Catches meaningful regressions, allows measurement noise
Segment Floor
Metric: sql_success_rate by user_role
Threshold: no segment < 50%
Prevents shipping broken Executive experience

Complete the narrative quality metric spec + classify your metric inventory + write release gate logic

Base (25-30 min)
  • Complete narrative quality metric spec (population, segments, thresholds, ownership)
  • Compute narrative quality by segment with confidence intervals
  • Classify metric inventory: blocking (3-5), guardrail, optimization
  • Write release gate logic as Boolean conditions
  • Document one metric conflict (retrieval depth vs cost)
  • Assemble the artifact
Extend (DS/Eng, +10-15 min)
  • Implement statistical confidence intervals from scratch (no library)
  • Retrospective validation: apply criteria to v0-to-v1
  • Segment confidence interval overlap analysis

Metric spec with release criteria showing blocking thresholds, guardrails, and segment rules

Metric Category v1 Baseline [95% CI] Release Threshold
sql_success_rate Blocking 68% [65%-71%] >= 70% overall, >= 50% all segments
narrative_quality Blocking 0.82 [0.79-0.85] >= 0.779 (delta < 5%)
oracle_pass_rate Blocking 61% [57%-65%] >= 55%
cost_per_query Guardrail $0.08 [0.07-0.09] < $0.092 (delta < 15%)
p95_latency Guardrail 1650ms [1500-1800] < 2000ms
chart_rendering Optimization 94% [91%-97%] tracked, not gating

Setting thresholds after looking at v2, ignoring segments, treating all metrics as blocking, ignoring baseline CI

⚠ Setting thresholds by looking at candidate results
Risk: Guaranteed to pass — p-hacking the ship decision
⚠ Forgetting segment-level thresholds
Risk: Aggregate hides broken Executive experience
⚠ Treating all metrics as blocking
Risk: Metric whack-a-mole — releases never pass
⚠ Ignoring baseline uncertainty
Risk: Blocking changes indistinguishable from current system

Should threshold be 68%, 65%, 70%, or something else? How do you decide blocking vs optimization metrics?

Question 1:
v1 baseline: 68% [65%-71%]. Set blocking threshold at 68%, 65%, 70%, or something else? Explain using the CI. What happens if threshold is inside vs below the CI?

Question 2:
12 metrics in dashboard. Colleague proposes making all 12 blocking. How many should be blocking? How do you decide? What is the risk of too many blocking metrics? Too few?

Question 3:
v2 narrative quality: 0.79. v1 baseline: 0.82. Release criteria: 'regression < 5%'. Does v2 pass? Show calculation. What if v1 CI is [0.79-0.85]?

Two layers: 8-component metric spec feeds into release criteria with blocking/guardrail/optimization gates

1
Metric Specification
8 components: Definition | Unit of Analysis | Population | Segments | Sampling | Aggregation | Ownership | Versioning
2
Release Criteria
Blocking (3-5): sql_success >= 60%, oracle_pass >= 55%, narrative delta < 5% | Guardrail: cost < +15%, p95_latency < 2000ms | Optimization (5-10): tracked, not gating
All blocking pass?
Check guardrails
Ship ✓

Thresholds set BEFORE evaluating candidate model

Next: Pipeline architecture

You have metrics and thresholds — now you need a system to evaluate 1000 traces in 3 minutes, not 3 hours
AI Analyst Lab | AI Evals for Product Dev | Week 4 Lesson 6 | aianalystlab.ai