Decision-making under uncertainty

Week 6 Lesson 1 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

What was the primary tradeoff you identified between blocking metrics and optimization metrics?

Blocking Metrics

Must pass to release

Example: SQL correctness must pass

Optimization Metrics

Tracked for improvement

Example: Retrieval quality tracked but doesn't block

You have all the metrics — now what?

Metric Result Status
Retrieval precision +53%
Hallucination rate 15% → 9% (-40%)
SQL correctness 82% (unchanged)
Average latency +110ms
P95 latency 1.8s → 2.4s (SLA breach 12%)
Cost per query +10%
Should we ship v2?

Real ship decisions are not "ship or don't"

Ship
Don't Ship
Your team needs six decision types, each with different evidence requirements.

Six decision types match the nuance of the evidence

Decision Type What It Means When It Applies
Ship Full rollout to all users High confidence on all blocking metrics, acceptable tradeoffs
Ramp Gradual rollout (10%, 25%, 50%, 100%) Moderate confidence, monitoring detects issues fast
Hold Do not deploy; gather more evidence Evidence insufficient or mixed signals
Rollback Reverse a deployed change Blocking metric failure post-deploy
Scope-restrict Deploy to specific segments only Treatment effect varies by user group (some benefit, others don't)
Add-human-in-loop Deploy with manual oversight Quality improvement exists but edge cases need review

Four dimensions convert signals to decisions

  • Direction certainty — Is the change better, worse, or mixed?
  • Magnitude assessment — How much better or worse on each dimension?
  • Confidence level — How strong is the evidence? (sample size, statistical significance)
  • Risk containment — What could go wrong, and how do you detect it?
Different decisions demand different evidence strength.

Every ship decision with conflicting signals must explicitly state tradeoffs

"We recommend [decision] because [primary improvement]. The risk is [regression]. We mitigate via [monitoring/rollback trigger/scope restriction]."
Four required elements:
• What improved (specific metrics and magnitudes)
• What regressed (specific metrics and magnitudes)
• Why the tradeoff is acceptable (justification tied to user value)
• What mitigation reduces the risk (monitoring, rollback trigger, scope restriction)

Predict which decision you would make for the v2 change

v2 Evidence Summary:

• Retrieval precision: +53%
• Hallucination rate: -40% (15% → 9%)
• Average latency: +110ms
• P95 latency: 1.8s → 2.4s (SLA breach 12%)
• Cost: +10%
• Segment differences: Executive segment sees negligible quality gain, full latency cost
Write down your choice: Ship, Ramp, Hold, Rollback, Scope-restrict, or Add-human-in-loop.

Segment breakdown reveals different results by user type

Segment Retrieval Improvement SLA Breach Rate
PM +62% 10%
DS +48% 9%
Engineering +35% 11%
Executive +8% 18%

The rubric points to Ramp with scope-restrict

Dimension Assessment
Direction certainty Mixed — retrieval ↑, hallucination ↓, latency ↑
Magnitude assessment Large improvement (+53% retrieval), moderate regression (+110ms average, SLA breach 12%)
Confidence level High (sample size: 10,000, tight confidence intervals)
Risk containment P95 latency by segment every 2 hours, rollback trigger: >2.5s for >10% queries
Recommended Decision: Ramp with scope-restrict

The memo states evidence, acknowledges regression, defines rollback trigger

Decision Memo

Recommendation: Ramp with scope-restrict
Primary evidence: Retrieval +53%, hallucination -40%
Tradeoff acknowledgment: Latency +110ms, SLA breach for Executive segment
Risk mitigation: P95 latency monitoring by segment
Rollback trigger: If P95 latency >2.5s for >10% queries in any 2-hour window, rollback immediately
Decision confidence: Medium-High

Write your own Decision Memo using the Ship Decision Framework

Base version (all students)
1. Complete missing rubric dimensions: confidence level, risk containment
2. Write complete Decision Memo as structured format with six fields
3. Run verify cell to confirm all components present
Extend version (DS/Eng)
4. Compute how much each segment improved or regressed
5. Calculate SLA breach % by segment
6. Estimate daily cost impact at 100k queries/day
7. Write monitoring query spec

Ship decision memo with rollback condition

Decision Matrix
Metric v1 v2
Retrieval baseline +53%
Hallucination 15% 9%
Latency baseline +110ms
Rollback Trigger
P95 latency >2.5s for >10% queries → rollback immediately

Segment breakdown informs scope decision

Segment Breakdown
Segment Retrieval Gain SLA Breach Rate
PM +62% 10%
DS +48% 9%
Engineering +35% 11%
Executive +8% 18%
Scope-Restrict Decision
Deploy to PM and DS segments where quality benefit justifies latency cost. Hold for Executive segment where quality gain is negligible.

Teams fail from single-metric anchoring, infinite evidence gathering, and no rollback plan

Failure Mode What Happens Prevention
Single-metric anchoring Ship on retrieval improvement alone, ignore latency regression Assess all blocking + optimization metrics
Infinite evidence gathering Hold indefinitely claiming insufficient data Evidence sufficiency rubric specifies when to act
No rollback plan Deploy without trigger, debate for days during incident Pre-committed rollback condition

Apply the decision rubric to these scenarios

Scenario 1
v2 shows: retrieval +40%, SQL +2%, latency +200ms (SLA breach 18%), cost +15%. All must-pass metrics pass. Which decision: Ship, Ramp, or Hold? Why?
Scenario 2
v2 improves average quality by 8%, but high-frequency power users (10% of base) experience -5% quality regression. What framework concept helps, and what would you recommend?
Scenario 3
Rollback trigger fires 2 days after deploying to 20% of users: "P95 latency >2.5s for >10% queries." What do you do, and why does having a pre-committed condition matter?

Decision rubric matrix maps evidence to justified decisions

Decision Blocking Metrics (must pass) Optimization Metrics (tracked for improvement) Risk Containment
Ship All pass, high confidence Meaningful improvement, acceptable tradeoffs Monitoring active, rollback plan tested
Ramp All pass, moderate confidence OK Directional improvement, some uncertainty OK Monitoring active, rollback trigger defined
Hold Insufficient evidence or mixed signals Insufficient evidence or mixed signals Insufficient evidence or mixed signals
Scope-restrict Segment-specific pass/fail Segment-specific pass/fail Segment-specific pass/fail
Tradeoff transparency template:
"We recommend [decision] because [primary improvement]. The risk is [regression]. We mitigate via [monitoring/rollback trigger/scope restriction]."

Next: Translating evaluation signals to product actions

What happens when rollback triggers fire and signal-metric divergence appears?
AI Analyst Lab | AI Evals for Product Dev | Week 6 Lesson 1 | aianalystlab.ai