Evaluation cadence

Week 6 Lesson 5 AI Evals for Product Dev
Shane Butler AI Analyst Lab

In Lesson 6.1, you made a ship/ramp/hold decision for the v2 change — who in your organization has the authority to override that recommendation, and what evidence threshold would justify an override?

Who has override authority?
?
What evidence threshold?
?

Six months post-ship, judge drift goes undetected — nobody owns calibration, nobody checks Kappa, production incident forces emergency remediation

Month 0
Ship v2
Judge agreement 0.72 (good)
Month 2
Drift to 0.64
Degrading — no detection
Month 4
Drift to 0.48
Failing — no detection
Month 6
Production incident
Emergency fix
"Monthly deep dive got cancelled twice, never rescheduled."

The gap isn't technical capability — it's organizational discipline: who owns evaluation activities, how often they occur, who acts on findings, and what happens when work gets deferred

Who owns it?
How often?
Who acts on findings?
What happens when deferred?

Three-tier framework balances thoroughness (catching real issues) with practicality (not drowning teams in meetings) — weekly checks, monthly deep dives, quarterly audits

Quarterly
4×/year · half day · strategic alignment (right metrics)
Monthly
12×/year · 90 min · chronic degradation (metrics declining)
Weekly
52×/year · 30-45 min · acute issues

Weekly quality check (30-45 min) — automated metric review, dashboard review, incident triage — flags issues for deeper analysis or immediate rollback

  • Automated metric review (15 min) — latency, cost, error rates, blocking metrics (must-pass thresholds for deploys)
  • Dashboard review (20 min) — any metrics outside acceptable ranges?
  • Incident triage (10 min) — new failure modes or user reports requiring investigation
Owner
AI Reliability Lead + on-call engineer
Output
Flag/escalate threshold violations

Monthly deep dive (90 min) — metric trend analysis, judge re-calibration check, segment health review, evaluation debt review — updates thresholds or escalates risks

Metric trend analysis
30 min — improving, stable, or degrading?
Judge calibration check
30 min — sample 20 real user queries, check if judge scores match human reviewers
Segment health review
20 min — which user groups or query types show quality degradation?
Evaluation debt review
10 min — what eval work was deferred, what's the current risk?
Owner
Cross-functional team (PM, DS, ML Eng, AI Reliability Lead)
Output
Metric updates, risk escalation

Quarterly regression audit (half day) — regression suite refresh, full judge re-calibration, experiment backlog prioritization, evaluation system health — major updates or roadmap adjustments

Regression suite refresh
90 min — update test case library: remove outdated examples, add new failure modes
Full judge re-calibration
90 min — test judge on 200+ reserved examples: true positive rate (correctly catches bad SQL), true negative rate (correctly approves good SQL), fix any bias
Experiment backlog prioritization
60 min — which improvements are worth testing next?
Evaluation system health
60 min — are we measuring what matters? Aligned with user value?
Owner
Extended team + leadership (VP Eng, Head of Product)
Output
Strategic changes, tooling updates

RACI matrix — every evaluation activity has exactly one Accountable person and at least one Responsible person — blank Accountable cells mean nobody owns the decision

Activity AI Rel Lead DS PM ML Eng VP Eng
Weekly quality check A, R I I C I
Monthly judge calibration C A, R I R I
Monthly segment health C R A, R C I
Quarterly system health C R R R A
Legend
R = does the work, A = owns the outcome, C = provides input, I = receives updates

Evaluation debt register makes invisible risk visible — tracks deferred work with risk level, target paydown date, and actual paydown date

Deferred Work Risk Deferred Date Target Paydown Status
Judge re-calibration (SQL correctness) — verify judge still accurate Medium 2025-01-15 2025-02-28 Open
New failure mode instrumentation (multi-turn drift) High 2025-01-08 2025-02-15 Open
Stale ground truth queries (Q4 2024 schema) — test cases use old database structure Medium 2025-01-22 2025-02-28 Open
Cap at 10 active entries — if adding an 11th, either pay down an existing one or explicitly deprioritize the lowest-risk item

Stop-and-predict: Which tier is most likely to get skipped when deadlines are tight, and what's the first symptom the team will notice if they skip that tier for 2 months?

Tier 1
Weekly (30-45 min)
Tier 2
Monthly (90 min)
Tier 3
Quarterly (half day)
What symptom surfaces after 2 months?

Cadence calendar shows 6.2 hours/month for a 5-person team — well under 10% of team capacity — weekly checks, monthly deep dives, and quarterly audits are operationally sustainable

Tier Time per occurrence Occurrences/month Monthly total
Weekly quality check 45 min 4 weeks 180 min
Monthly deep dive 90 min 1 90 min
Quarterly regression audit 300 min (half day) 1/3 100 min
Total per month 370 min = 6.2 hours
Per-person load
6.2 hours ÷ 5 people = 1.2 hours/month per person
Well under 10% of capacity — operationally sustainable

Monthly deep dive would have caught judge drift in Month 2 — Kappa 0.64 triggers 'Full re-calibration needed' debt entry — quarterly audit fixes it before 6 months of silent failures

WITHOUT CADENCE
Month 0
Judge agreement 0.72 (good)
Month 2
0.64 (degrading)
No detection
Month 6
0.48 (failing)
Incident
WITH CADENCE
Month 0
Judge agreement 0.72 (good)
Month 1
Check: 0.70 (still acceptable)
Month 2
0.64 (warning threshold)
Debt entry added
Month 3
Quarterly audit
Fixed

Build a three-document Evaluation Cadence Package — cadence calendar, RACI matrix, evaluation debt register — operational governance that keeps evaluation from becoming invisible work

Cadence calendar
11 activities across 3 tiers with duration, owner, output
RACI matrix
Ownership for 6 activities × 5 roles (exactly one A per activity)
Evaluation debt register
3 deferred work entries with risk levels, dates
Time estimate
Base: 25 min | Extend: +15 min (escalation playbook + skip simulation)

Skipping Tier 2 (monthly deep dive) accumulates silent risk — chronic degradation invisible to weekly checks and quarterly audits too late — judge drift example shows 12 weeks of incorrect SQL approval

WEEKLY CHECKS CATCH
Acute issues: system down, error spike
GAP ZONE
Chronic degradation:
• Judge quality declining (approving more bad SQL over time)
• Test coverage shrinking for certain user groups
Monthly deep dive catches this
QUARTERLY AUDITS CATCH
Strategic drift: wrong metrics

Apply cadence reasoning to judge drift, RACI conflicts, and evaluation debt risk translation — knowledge check scenarios

⚖️
Knowledge check: Judge drift undetected
A team has a weekly 30-minute quality check and a quarterly 4-hour regression audit. But no monthly deep dive. Nothing in the middle. Three months after shipping v2, they discover their SQL correctness judge has drifted from Kappa 0.72 (good agreement with human reviewers) to Kappa 0.48 (poor agreement — failing threshold). What Tier-2 activity would have caught this drift earlier? How often should it run?

Three-tier pyramid — weekly checks (52×/year, acute issues), monthly deep dives (12×/year, chronic degradation), quarterly audits (4×/year, strategic alignment)

QUARTERLY (4×/year)
Regression suite, full re-calibration, backlog, system health
Owner: Extended team + leadership
MONTHLY (12×/year)
Trend analysis, judge check, segment review, debt review
Owner: Cross-functional team
WEEKLY (52×/year)
Automated metrics, dashboard, incident triage
Owner: AI Reliability Lead + on-call

RACI ownership assigns one Accountable person per tier — debt register tracks deferred work with risk levels

RACI Ownership
Tier 1: AI Reliability Lead Accountable
Tier 2: Data Scientist Accountable (most activities)
Tier 3: VP Engineering Accountable
Debt Register
Low Medium High
Cap at 10 active entries

Next: Communicating AI product impact

How to translate evaluation evidence into executive-ready impact stories, product-ready decision memos, and engineering-ready debugging reports
AI Analyst Lab | AI Evals for Product Dev | Week 6 Lesson 5 | aianalystlab.ai