Metric design patterns

Week 4 Lesson 2 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

In Lesson 4.1, you classified metrics as blocking vs. optimization — what is the key difference in how these two metric types affect ship decisions?

Blocking metric Optimization metric
Mental comparison space Mental comparison space

Take 30 seconds to answer from memory.

You have metrics but no measurement system

AI Data Analyst v1 metrics
  • Retrieval recall@5
  • SQL correctness
  • Narrative grounding
PM asks:
"Are we ready to ship?"
You have numbers, but no answer.

Metrics built bottom-up exist in isolation — no one knows which metric gates the release

Retrieval team
recall@5
SQL team
execution success rate
Narrative team
LLM grounding score
Which metric blocks the release?
Which metric is tracked for improvement?
At what granularity should we measure?

Measurement archetypes are reusable metric structures tailored to feature types

Each archetype specifies five things
Unit of measurement
output / task / session / workflow
Core dimensions
which quality aspects matter
Blocking metrics
must-pass thresholds
Optimization metrics
tracked for improvement
Aggregation strategy
how to roll up scores

Six primary archetypes cover most AI features

Archetype Feature type Default unit Blocking examples
Drafting Content generation where users edit outputs Output + edit signal Grammar errors, safety violations
Summarization Condensing long content into key points Output (faithfulness) Hallucinated facts, missing key info
Extraction Structured data from unstructured input Output (precision/recall) Output format errors, missing required fields
RAG Retrieval-augmented generation Component-level Unsupported claims, retrieval misses
Agents Multi-step tool use Session-level Task failure, wrong tool selection
Decision Support Recommendations or guidance Outcome-level Overconfident wrong recommendations

The AI Data Analyst is a hybrid — decompose it into components and apply a different archetype to each

Retrieval + Narrative
RAG archetype
recall@5
SQL Generation
Tool calling / extraction hybrid
SQL correctness
Narrative Generation
Summarization archetype
grounding

Stop and predict: which measurement unit is most appropriate for the AI Data Analyst's narrative generation component?

Output-level (one narrative per query)
Task-level (user's analytical question answered)
Session-level (coherence across multi-turn conversation)
Workflow-level (user makes a decision based on the analysis)

Write your prediction and justify it in 1-2 sentences before proceeding.

Component-level evaluation reveals which part failed — separating retrieval and narrative shows different problems

End-to-end quality
0.74
Combined score
Can't tell which component to fix
Component-level evaluation
0.80
Retrieval recall@5
0.84
Narrative grounding
Different problems, different fixes

SQL correctness is a blocking metric at >= 90% — narrative conciseness is an optimization metric

Blocking metric Optimization metric
SQL correctness
Threshold: ≥ 90%
89% (FAIL)
SQL errors break user trust immediately. One bad query per session is unacceptable.
Narrative conciseness
Target: 0.85
0.72
Verbose narratives reduce readability but don't break functionality. Track trend, don't gate releases.

Apply archetypes to the AI Data Analyst components

Base version (20 min)
  • Examine entity extraction results, identify lowest-precision entity type
  • Set blocking threshold for refusal detection (≥ 95% precision)
  • Write justification: why is refusal detection precision blocking?
  • Compute session-level pass rates using two aggregation strategies
  • Compare worst-case (SQL) vs. average (narrative) aggregation
  • Fill Metric Archetype Selection Template for one component
Deliverable: Completed template

Metric Archetype Selection Template with blocking thresholds and optimization targets

Component Metric Type Metric Name Threshold/Target Current Score Pass/Fail
SQL Generation Blocking SQL correctness ≥ 90% 89% FAIL
Narrative Optimization Conciseness 0.85 0.72
Retrieval Blocking Recall@5 ≥ 0.80 0.80 PASS
Portfolio-ready artifact — distinguishes blocking metrics from optimization metrics with computed values and confidence intervals.

Five ways metric design patterns fail in practice

1. Measuring at the wrong granularity
Output-level metrics miss session-level failures
2. Treating all metrics as blocking
Team delays ship for verbosity while SQL errors persist
3. Aggregating incorrectly across sessions
Averaging dilutes single failures
4. Mismatching archetype to feature type
Extraction metrics miss faithfulness
5. Setting thresholds without failure cost analysis
95% sounds good but costs $100K/day at volume

Apply archetype selection to new scenarios

Scenario 1
AI Data Analyst narrative: grounding (0.84 ± 0.03) vs. includes all key numbers (0.91). Which is blocking?
Scenario 2
Drafting feature: tone 0.88, grammar 0.97, length 0.72. Which blocks the release?
Scenario 3
Agent feature: tool call correctness 0.94, task success 0.78, workflow completion 0.65. Which unit drives ship decision?

Metric Design Pattern Selection Flow

Stage 1: Match feature type to archetype
Edits it
→ Drafting
Reads for facts
→ Summarization
Validates data
→ Extraction
Asks follow-up
→ RAG
Multi-step goal
→ Agent
Makes decision
→ Decision Support
Stage 2: Classify into 2×2 matrix
Quality Performance
Blocking SQL correctness, grounding Latency cap
Optimization Conciseness, tone Token efficiency
→ Release Criteria (L4.6)

Next: Cost-Aware Evaluation — making your measurement system affordable

You'll learn to cut evaluation costs by up to 90% using stratified sampling and judge cascades — without losing visibility into the metrics that matter.
AI Analyst Lab | AI Evals for Product Dev | Week 4 Lesson 2 | aianalystlab.ai