Metric Strategy

4.1

Week 4 ยท Lesson 1 ยท AI Evals for Product Dev
Shane Butler ยท AI Analyst Lab

Which metrics from your Judge Report Card are trustworthy enough to use in a release gate, and which should only be used for continuous improvement?

Release Gate
Blocks deployment if threshold fails
Think: Which metrics would you trust to block a release?
Continuous Improvement
Tracked over time, doesn't block ship
Think: Which metrics are too noisy to gate?

12 metrics, one PM question: "Should we ship v2?"

12 Evaluation Metrics
Retrieval Recall@5, SQL Correctness, Narrative Faithfulness, Cost per Query, Latency p95...
โ†“
PM Question
"Should we ship v2?"
You cannot answer "ship if all 12 metrics look good." That's evaluation theater, not a decision framework.

Classify metrics as blocking or optimization

Blocking Metric
Must pass threshold for release to proceed
๐Ÿšง
Optimization Metric
Tracked for improvement over time but not gating releases
๐Ÿ“ˆ

The decision test: "If this fails but everything else passes, do we ship?"

Metric fails threshold
โ†“
Answer: No
โ†’ Blocking metric
Answer: Note it and keep improving
โ†’ Optimization metric

Build a metric tree โ€” from North Star to drivers to low-level metrics

1
North Star Metric
Single measure of user value (e.g., User Trust)
2
Driver Metrics
Mid-level metrics that feed into the North Star (4-6 metrics)
3
Low-Level Metrics
Output quality metrics from traces (8-12 metrics)

Every blocking threshold needs a documented rationale

Metric Threshold Rationale
SQL Correctness >98% Execution-based: syntactically valid SQL is required to return results
Cost per Query <$0.15 Business impact: $0.10 infra cost + $0.05 margin buffer at 10k queries/day
Policy Compliance >97% Regulatory: violations have legal consequences
Why this matters
Without a rationale, the threshold is arbitrary and undefendable. When SQL correctness hovers at 91% and drops to 89%, you need a basis for deciding whether to hold or ship.

Four rationale sources: execution-based, business impact, competitive benchmarking, user tolerance

Execution-based correctness
Must be 100% valid to function. Example: SQL must parse.
Business impact modeling
Cost/revenue constraints. Example: Cost <$0.15 or unprofitable.
Competitive benchmarking
Match or exceed alternatives. Example: Quality >= manual dashboards.
User tolerance thresholds
Behavioral tipping points. Example: Latency <5s or users abandon.

How many of your 12 metrics would you classify as blocking?

Before reading the worked example:
How many of the 12 metrics would you classify as blocking
(must-pass for release)?
Write your prediction:
___
out of 12 metrics
After the demo, compare your prediction against the reference classification.

The metric tree: three levels from User Trust to driver metrics

User Trust
% queries โ†’ user action within 1 hour
Answer Correctness ยท Response Latency ยท Cost per Query ยท Multi-turn Coherence
Blocking metric
Optimization metric

Three blocking metrics with clear rationales

Metric Blocking? Threshold Rationale Source
SQL Syntax Validity Yes >98% Execution-based correctness โ€” if SQL doesn't parse, system returns nothing
Cost per Query Yes <$0.15 Business impact โ€” $0.10 infra + $0.05 margin at 10k queries/day = unprofitable above this
Policy Compliance Rate Yes >97% Regulatory โ€” single violation could have legal consequences
The remaining 9 metrics are classified as **optimization** โ€” tracked for improvement, reviewed on a cadence, but not gating release.

Classify 4 ambiguous metrics and design release criteria for v2

Base Exercise
  • Classify 4 metrics: Retrieval Recall@5, Latency p95, Judge Agreement, Multi-turn Coherence
  • Apply decision test for each
  • Write blocking/optimization + rationale
  • Design release criteria for v1โ†’v2 retrieval improvement
Extend Exercise (DS/Eng)
  • Quantify threshold rationales
  • Map metric dependencies explicitly
  • Compute acceptable ranges from current values
  • Include sensitivity analysis

A metric system spec with classification, tree, and release criteria

Classification Table
Blocking vs. optimization with threshold rationales
Metric Tree
3-level hierarchy linking output quality to user trust
Release Criteria
IF-THEN format with thresholds and acceptable tradeoffs
This goes in your portfolio
A PM uses this in a ship meeting. A DS uses this to prioritize metric build-out. An engineer uses this to implement monitoring and alerting. It's the translation layer between evaluation evidence and ship decisions.

Five ways metric strategy breaks

  1. Everything is blocking โ†’ system never ships
    10 blocking metrics = probability all pass simultaneously is vanishingly small

  2. No threshold rationale โ†’ arbitrary and undefendable thresholds
    When metric hovers near the line, team has no basis for hold/ship decision

  3. Metric proliferation without hierarchy โ†’ dashboard paralysis
    Flat list of 12 metrics, no clear story, arguments about individual numbers

  4. Confusing blocking metrics with runtime checks โ†’ implementation mismatch
    Blocking = aggregate quality, gates releases. Runtime = per-request protection

  5. Static thresholds ignore context โ†’ one-size-fits-all quality bars
    Lookup queries need perfect correctness. Complex causal analysis tolerates uncertainty

Three judgment scenarios โ€” blocking vs. optimization in practice

Scenario 1
SQL correctness >95% (blocking). Change improves retrieval +10 percentage points and completeness +8 percentage points, but SQL drops to 93%. PM says overall quality is clearly better. Do you ship? Why or why not?

Scenario 2
Answer completeness = optimization metric, target >=80%, current 78%. Six months later: 83%, but PM says users aren't happier. What might this signal?

Scenario 3
Threshold: Latency p95 <5s because that's our SLA. Is this valid? What additional info would strengthen it, and which rationale source does it use?

Pause and write your answers
These scenarios test whether you understand blocking metrics, metric tree dependencies, and threshold rationale sources.

The metric tree โ€” North Star, drivers, low-level metrics with blocking/optimization labels

User Trust in AI Data Analyst
Proxy: % queries followed by user action within 1 hour
Answer Correctness Response Latency Cost per Query Multi-turn Coherence
SQL Correctness (Blocking)
Threshold: >98% | Rationale: Execution-based correctness โ€” syntactically valid SQL is a hard requirement for a data analyst.
Blocking metric (must pass threshold for release)
Optimization metric (continuous improvement target)

Next: Metric Design Patterns for AI Features

How to design the metrics in your tree โ€” comparing text similarity, checking if the right documents were retrieved, using AI to judge nuanced quality, and running the code to verify correctness
AI Analyst Lab | AI Evals for Product Dev | Week 4 Lesson 1 | aianalystlab.ai