Ownership model

Week 6 Lesson 4 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

Who on your team would sign off on the ship decision, and what evaluation evidence would they need?

Who signs off?
PM
Owns ship decision
ML Engineer
Owns model quality
Product Eng
Owns deployment
What evidence do they need?
  • Metric specs
  • Experiment results
  • Judge calibration report
  • Monitoring plan

Without clear ownership, evaluation systems decay silently

At Launch
  • Judges calibrated
  • Regression suite complete
  • Monitoring assigned
  • Thresholds documented
Six Months Later
  • Judges uncalibrated
  • Regression suite stale
  • No one watching dashboard
  • Thresholds lost in Slack

Every evaluation activity needs exactly one Accountable owner

PM ML Engineer Domain Expert QA
Judge calibration I A C R
Ship decision A C C I
Critical rule: exactly one 'A' per row
R (Responsible): Does the work
A (Accountable): Owns the outcome
C (Consulted): Provides input
I (Informed): Receives updates

The debt register makes invisible work visible and trackable

Item ID Description Risk Owner Target Status
DEBT-001 4 judges uncalibrated (6 months) H ML Engineer Mar 15 Open
DEBT-002 Regression suite incomplete (3 query types not tested) M Product Eng Mar 30 Open
DEBT-003 Thresholds only in Slack M PM Feb 28 Open

Judges drift — quarterly audit is the minimum for production AI

Week 1
Weekly quality review: metrics, judge drift, failure modes
Month 1
Monthly deep dive: 30-day trends, debt register
Quarter 1
Quarterly audit: judge recalibration, regression coverage
Repeat
Cycle continues

Clear escalation prevents alert fatigue and ensures critical issues reach decision-makers

M
Regression suite coverage <80%
Escalate within 1 week
H
Judge drift >10%
Escalate within 48 hours
H
Primary metric degrades >5% over 7 days
Escalate within 24 hours
C
Safety violation in production
Escalate immediately (within 1 hour)

Your SQL judge was calibrated six months ago — what happened to its false positive rate?

At launch: Judge calibrated for GPT-4 (60% correct SQL)
Two months ago: System upgraded to GPT-4o (78% correct SQL)
Prediction: What happened to false positive rate (flagging correct SQL as incorrect)?

Six months after launch: four uncalibrated judges, stale regression suite, no one watching the dashboard

Component Finding Status
4 LLM judges Last calibrated at launch TPR/FPR unknown
Regression suite 80 examples, no updates 3 new query types not covered
Monitoring dashboard Exists, but no one assigned No one watching
Metric thresholds Documented in Slack thread New team members can't find them

"Everyone is Responsible" means no one is Accountable

Current State
Judge calibration: R = ?, A = none
Desired State
Judge calibration: R = QA, A = ML Eng
Every row needs exactly one A

DEBT-001: high risk, ML Engineer assigned, target March 15

DEBT-001
Description: 4 LLM judges uncalibrated since launch (6 months ago)
Risk: High
Impact if unaddressed: False positive rate unknown; may be flagging correct outputs; ship decisions based on unreliable evidence
Owner: ML Engineer
Target Date: 2026-03-15
Status: Open
Notes: Schedule 2-day calibration sprint

Weekly quality review, monthly deep dive, quarterly audit

Weekly
Attendees: PM, ML Eng, Domain Expert
Review: Metrics dashboard, signs of judge drift (accuracy changes)
Monthly
Attendees: PM, ML Eng, Product Eng, Domain Expert
Review: 30-day trends, debt register
Quarterly
Attendees: All roles
Review: Judge Report Card, regression coverage, thresholds

Safety violation = critical, escalate immediately to PM + ML Engineer

Trigger Severity Escalate From Escalate To Response Time
Safety metric fails (policy violation) Critical On-call Eng PM + ML Eng Immediate (1 hour)
Primary metric degrades >5% over 7 days High ML Engineer PM 24 hours
Judge drift >10% High ML Engineer PM 48 hours
Experiment shows segment regression Medium Data Scientist PM Before ship decision

Design ownership for your AI Data Analyst — RACI, debt register, review cadences, escalation triggers

1. Complete RACI matrix
10 activities, exactly one A per row
2. Identify eval debt
5+ unmaintained components from Weeks 2-5
3. Build debt register
Risk, owner, target date for each item
4. Define review cadences + escalation
3 review types, 3 escalation triggers

RACI matrix + evaluation debt register with 5 items tracked

RACI Matrix
Judge calibration R=QA, A=ML Eng
Monitoring R=Product Eng, A=PM
Ship decision A=PM
Debt Register
H Uncalibrated judges → ML Eng (Mar 15)
M Stale regression suite → Product Eng (Mar 30)
M Thresholds in Slack → PM (Feb 28)

Everyone is Responsible, no one is Accountable — evaluation decays silently

Everyone is R, no one is A
Judges uncalibrated, regression suite stale, no one owns it
Debt register never reviewed
Register becomes a graveyard, debt compounds for a year
No escalation path for alerts
Dashboard exists, no one watches it, issues discovered via customer escalations

Can one person be both R and A? Is uncalibrated judge low risk or high risk? When do you escalate?

Question 1
Your RACI lists ML Engineer as both R and A for judge calibration, and Domain Expert as A for threshold setting. What is wrong?
Question 2
Debt item: Judge uncalibrated for 6 months. Classify as low risk because judge was accurate at launch and metrics look stable. Is this correct?
Question 3
Monitoring shows SQL success rate dropped from 74% to 68% over 5 days. Escalation trigger: >5% degradation over 7 days. Do you escalate now or wait?

Three layers: RACI Matrix, Debt Register, Review & Escalation Flow

1
RACI Matrix
Who does what — exactly one A per activity
2
Evaluation Debt Register
What needs fixing — tracked items with owners and target dates
3
Review & Escalation Flow
When/how it's reviewed — weekly quality review → monthly deep dive → escalation triggers

Next: Evaluation cadence and governance

How do you balance evaluation rigor with iteration speed?
AI Analyst Lab | AI Evals for Product Dev | Week 6 Lesson 4 | aianalystlab.ai