Lesson 4.6 — Metric specs + release criteria

Pattern	Example	Purpose
Absolute floor	SQL success >= 60%	Minimum usability standard
Delta threshold	Narrative quality regression < 5%	Prevent meaningful regressions
Confidence-aware	Statistical check confirms improvement is real, not random noise	Ensure improvement isn't noise
Segment-aware	No segment < 50%	Prevent hiding broken segments

Metric	Category	v1 Baseline [95% CI]	Release Threshold
sql_success_rate	Blocking	68% [65%-71%]	>= 70% overall, >= 50% all segments
narrative_quality	Blocking	0.82 [0.79-0.85]	>= 0.779 (delta < 5%)
oracle_pass_rate	Blocking	61% [57%-65%]	>= 55%
cost_per_query	Guardrail	$0.08 [0.07-0.09]	< $0.092 (delta < 15%)
p95_latency	Guardrail	1650ms [1500-1800]	< 2000ms
chart_rendering	Optimization	94% [91%-97%]	tracked, not gating

Should threshold be 68%, 65%, 70%, or something else? How do you decide blocking vs optimization metrics?

Question 1:
v1 baseline: 68% [65%-71%]. Set blocking threshold at 68%, 65%, 70%, or something else? Explain using the CI. What happens if threshold is inside vs below the CI?

Question 2:
12 metrics in dashboard. Colleague proposes making all 12 blocking. How many should be blocking? How do you decide? What is the risk of too many blocking metrics? Too few?

Question 3:
v2 narrative quality: 0.79. v1 baseline: 0.82. Release criteria: 'regression < 5%'. Does v2 pass? Show calculation. What if v1 CI is [0.79-0.85]?

Executive	55% [49%-61%]
PM	72% [68%-76%]
DS	71% [66%-76%]

Ambiguous	45% [38%-52%]
Moderate	68% [63%-73%]
Simple	82% [78%-86%]

Lesson 4.6 — Metric specs + release criteria

Your PM asks: "Are we shipping?" but nobody has a framework for resolving metric conflicts

A metric without a specification is a number on a dashboard — with a spec it's a contract

Metric specs require 8 components: definition, unit, population, segments, sampling, aggregation, ownership, versioning

Release criteria classify metrics as blocking, optimization, or guardrail — and define thresholds

Four threshold patterns: absolute floors, delta thresholds, confidence-aware, segment-aware

Your v1 baseline is 68% [65%-71%]. What threshold would you set and why?

Set thresholds before evaluating the candidate model — looking at v2 first is p-hacking

Executive users at 55%, ambiguous queries at 45% — segment baselines drive segment thresholds

Execution-based evaluation: did the SQL return the right answer, not just run without error?

60% absolute floor (below CI), 5% delta for narrative quality, 50% segment floor

Complete the narrative quality metric spec + classify your metric inventory + write release gate logic

Metric spec with release criteria showing blocking thresholds, guardrails, and segment rules

Setting thresholds after looking at v2, ignoring segments, treating all metrics as blocking, ignoring baseline CI

Should threshold be 68%, 65%, 70%, or something else? How do you decide blocking vs optimization metrics?

Two layers: 8-component metric spec feeds into release criteria with blocking/guardrail/optimization gates

Next: Pipeline architecture

Lesson 4.6 — Metric specs + release criteria

What was the strongest driver you identified, and how did knowing that driver change what intervention you would recommend first?

Your PM asks: "Are we shipping?" but nobody has a framework for resolving metric conflicts

A metric without a specification is a number on a dashboard — with a spec it's a contract

Metric specs require 8 components: definition, unit, population, segments, sampling, aggregation, ownership, versioning

Release criteria classify metrics as blocking, optimization, or guardrail — and define thresholds

Four threshold patterns: absolute floors, delta thresholds, confidence-aware, segment-aware

Your v1 baseline is 68% [65%-71%]. What threshold would you set and why?

Set thresholds before evaluating the candidate model — looking at v2 first is p-hacking

Executive users at 55%, ambiguous queries at 45% — segment baselines drive segment thresholds

Execution-based evaluation: did the SQL return the right answer, not just run without error?

60% absolute floor (below CI), 5% delta for narrative quality, 50% segment floor

Complete the narrative quality metric spec + classify your metric inventory + write release gate logic

Metric spec with release criteria showing blocking thresholds, guardrails, and segment rules

Setting thresholds after looking at v2, ignoring segments, treating all metrics as blocking, ignoring baseline CI

Should threshold be 68%, 65%, 70%, or something else? How do you decide blocking vs optimization metrics?

Two layers: 8-component metric spec feeds into release criteria with blocking/guardrail/optimization gates

Next: Pipeline architecture