Prioritization

Lesson 6.3
Week 6 · AI Product Decisions
Shane Butler · AI Analyst Lab

Knowing WHAT to fix vs. knowing which fixes to do FIRST?

Findings from Week 6.2
  • SQL errors
  • Retrieval failures
  • Cost regressions
  • Policy violations
Ranked Backlog
  • 1. ???
  • 2. ???
  • 3. ???
→ ?

Eight failure modes, all backed by evidence — which one do you fix first?

SQL Errors
12% frequency · High severity
Retrieval Failures
18% frequency · Medium severity
Policy Violations
0.5% frequency · Catastrophic severity
Formatting Issues
15% frequency · Low severity
Cost Regressions
8% frequency · Medium severity
Hallucinations
3% frequency · High severity
Latency Spikes
5% frequency · Low severity
Citation Errors
7% frequency · Medium severity

Prioritization uses seven dimensions plus acceptance criteria

Work Item
Failure mode or improvement
7-Dimension Matrix
Impact: User Harm · Frequency · Business Crit · Confidence
Velocity: Fixability · Time-to-Learn · Reversibility
Priority Score
+ Acceptance Criteria

User harm severity measures impact magnitude, including silent failures

5
Catastrophic
Safety or compliance violation
4
High
Data loss or wrong decision
3
Medium
Workflow blocked temporarily
2
Minor
Extra step required
1
Cosmetic
Formatting inconsistency
Silent failures count
Errors users don't notice but affect downstream decisions

Frequency, business criticality, and confidence form the evidence base

Frequency
18% of queries
(from analysis of quality drivers)
Business Criticality
Affects PM/DS segments
(60% of revenue)
Confidence
High — backed by failure taxonomy, quality analysis, A/B test results
All three require evidence citations — you cannot assign a high score without pointing to a prior artifact

Fixability, time-to-learn, and reversibility control iteration velocity

Fast Iteration
AI instruction tweak → Pre-production testing (hours) → Instant rollback
✓ High velocity, low risk
Slow Iteration
Retrieval architecture change (how the AI finds context) → Production metrics (2 weeks) → Migration rollback (days)
⚠ Low velocity, high risk
Velocity vs. Risk tradeoff

Acceptance criteria define "done" — no perpetual refinement

Bad
Work Item: Improve retrieval quality
Status: In Progress (Sprint 3)
Good
Work Item: Retrieval finds the right context in the top 3 results 75% of the time (up from 68%) on metric definition queries
Status: Done — validated on test set
Vague direction vs. Specific threshold

Segment-aware and tail-risk weighting prevent frequency bias

Overall Frequency: 10%
⚠ Misleading — hides segment variance
Segment Failure Rate vs. Overall
PM 8% 0.8x
DS 40% 4.0x
Eng 3% 0.3x
Exec 12% 1.2x
Tail Risks
0.5% frequency but catastrophic harm requires manual upweighting

Retrieval quality (18% frequency, medium harm) vs. policy compliance (0.8% frequency, catastrophic harm) — which ships first?

Work Item Frequency User Harm Fixability Time-to-Learn Reversibility
Improve retrieval quality for metric definition queries 18% Medium High Fast High
Fix policy compliance failures on PII (Personally Identifiable Information) queries 0.8% Catastrophic Medium Slow Low
Before looking at the framework's ranking — which would YOU prioritize first and why?

Retrieval quality ranks #1 with priority score 28.5 — high fixability and fast learning win

Rank Work Item Priority Score
1 Retrieval quality (metric def queries) 28.5
2 Hallucination rate (revenue queries) 26.5
3 SQL logic errors (multi-table joins) 24.0
4 Policy compliance (PII queries) 22.0
5 Cost guardrails (Eng segment) 20.5
Weighted formula
(user_harm × 2) + (frequency × 1) + ... + (time_to_learn × 1.5)

Acceptance criteria specify metric threshold, segment, and evidence

  • What metric must improve by how much? → Retrieval finds the right context in the top 3 results 75% of the time (up from 68%)
  • What segment must see relief? → PM and Data Science user segments (60% of queries)
  • What evidence closes this ticket? → Test set validation with 50 queries with known-correct answers shows accuracy >= 75%
Specific, measurable, evidence-backed — no perpetual refinement

Segment check reveals Data Science users experience retrieval failures at 1.4x the overall rate

Overall
18%
1.0x baseline
PM
22%
1.2x
DS
25%
1.4x
Eng
8%
0.4x
Exec
15%
0.8x
Data Science segment is disproportionately affected — consider DS-specific acceptance criteria or early rollout

Build a ranked iteration backlog with acceptance criteria

  • Pull up your evaluation outputs from Weeks 1-5
  • Identify 8-10 improvement candidates
  • Score each across 7 dimensions (1-5, cite evidence)
  • Compute weighted priority scores
  • Rank descending by score
  • Segment check for #1 item
  • Write acceptance criteria for top 5 items
  • Export ranked backlog table
Base version: 20-25 min. Extend version: +10-15 min for sensitivity analysis and heatmap visualization.

Prioritization matrix heatmap showing tradeoffs across 7 dimensions

Work Item User Harm Frequency Biz Crit Confidence Fixability Time-to-Learn Reversibility
Retrieval quality 3 5 4 5 5 5 5
Hallucination rate 5 3 5 4 3 3 4
SQL logic errors 4 4 3 4 3 4 4
Policy compliance 5 1 5 4 3 2 2
Cost guardrails 3 3 4 5 4 3 4
High (4-5) Medium (3) Low (1-2)

Knowledge Check

Question 1: You write an acceptance criterion: "Improve task completion rate." A reviewer says this is not specific enough. Rewrite it as a well-formed acceptance criterion using the format from this lesson.

Question 2: Your #1 ranked item is "Fix retrieval failures on schema lookups." You perform a segment check and discover this issue affects 35% of Data Science user queries but only 5% of PM user queries. Data Science users generate 40% of revenue. Does this change your prioritization? Why or why not?

Ignoring tail risks because frequency is low — 0.5% policy violations get deprioritized

Mistake
Backlog ranking:
...
8. Policy violations (0.5% frequency)
Low frequency → deprioritized
Consequence
⚠ Compliance Escalation
Safety Violation in Production
Catastrophic harm realized

Stakeholder says "prioritize B — it affects way more users" — what's missing from this reasoning?

Scenario
Item A: Fix hallucinations (fabricated data) on revenue queries (0.8% frequency, catastrophic harm, low fixability, slow learning)
Item B: Improve SQL success on simple lookups (15% frequency, minor harm, high fixability, fast learning)
💬 Stakeholder:
"Prioritize B — it affects way more users."
Question: What is missing from this reasoning? How would you respond using the prioritization framework?

Evidence → Scoring Matrix → Ranked Backlog with Acceptance Criteria

Inputs
Failure Taxonomy (W1) · Quality Analysis (W4) · A/B Test Results (W5)
Scoring Matrix
Rows = work items · Columns = 7 dimensions · Priority Score = Sum of (Dimension × Weight)
Ranked Backlog
Sorted by score · Top 3 items have acceptance criteria · Segment check for #1
This turns "is this good enough?" into "have we addressed all priority-1 and priority-2 items with acceptance criteria met?" — evidence-based prioritization, not opinion-based.

Next: Ownership model, the AI Reliability Lead, and decision rights

👤
Ownership
🔀
Decision Rights
👥
AI Reliability Lead
AI Analyst Lab | AI Evals for Product Dev | Week 6 Lesson 3 | aianalystlab.ai