Evaluation cadence

Quarterly regression audit (half day) — regression suite refresh, full judge re-calibration, experiment backlog prioritization, evaluation system health — major updates or roadmap adjustments

Regression suite refresh

90 min — update test case library: remove outdated examples, add new failure modes

Full judge re-calibration

90 min — test judge on 200+ reserved examples: true positive rate (correctly catches bad SQL), true negative rate (correctly approves good SQL), fix any bias

Experiment backlog prioritization

60 min — which improvements are worth testing next?

Evaluation system health

60 min — are we measuring what matters? Aligned with user value?

Owner

Extended team + leadership (VP Eng, Head of Product)

Output

Strategic changes, tooling updates

Activity	AI Rel Lead	DS	PM	ML Eng	VP Eng
Weekly quality check	A, R	I	I	C	I
Monthly judge calibration	C	A, R	I	R	I
Monthly segment health	C	R	A, R	C	I
Quarterly system health	C	R	R	R	A

Deferred Work	Risk	Deferred Date	Target Paydown	Status
Judge re-calibration (SQL correctness) — verify judge still accurate	Medium	2025-01-15	2025-02-28	Open
New failure mode instrumentation (multi-turn drift)	High	2025-01-08	2025-02-15	Open
Stale ground truth queries (Q4 2024 schema) — test cases use old database structure	Medium	2025-01-22	2025-02-28	Open

Tier	Time per occurrence	Occurrences/month	Monthly total
Weekly quality check	45 min	4 weeks	180 min
Monthly deep dive	90 min	1	90 min
Quarterly regression audit	300 min (half day)	1/3	100 min
Total per month			370 min = 6.2 hours

Evaluation cadence

In Lesson 6.1, you made a ship/ramp/hold decision for the v2 change — who in your organization has the authority to override that recommendation, and what evidence threshold would justify an override?

Six months post-ship, judge drift goes undetected — nobody owns calibration, nobody checks Kappa, production incident forces emergency remediation

The gap isn't technical capability — it's organizational discipline: who owns evaluation activities, how often they occur, who acts on findings, and what happens when work gets deferred

Three-tier framework balances thoroughness (catching real issues) with practicality (not drowning teams in meetings) — weekly checks, monthly deep dives, quarterly audits

Weekly quality check (30-45 min) — automated metric review, dashboard review, incident triage — flags issues for deeper analysis or immediate rollback

Monthly deep dive (90 min) — metric trend analysis, judge re-calibration check, segment health review, evaluation debt review — updates thresholds or escalates risks

Quarterly regression audit (half day) — regression suite refresh, full judge re-calibration, experiment backlog prioritization, evaluation system health — major updates or roadmap adjustments

RACI matrix — every evaluation activity has exactly one Accountable person and at least one Responsible person — blank Accountable cells mean nobody owns the decision

Evaluation debt register makes invisible risk visible — tracks deferred work with risk level, target paydown date, and actual paydown date

Stop-and-predict: Which tier is most likely to get skipped when deadlines are tight, and what's the first symptom the team will notice if they skip that tier for 2 months?

Cadence calendar shows 6.2 hours/month for a 5-person team — well under 10% of team capacity — weekly checks, monthly deep dives, and quarterly audits are operationally sustainable

Monthly deep dive would have caught judge drift in Month 2 — Kappa 0.64 triggers 'Full re-calibration needed' debt entry — quarterly audit fixes it before 6 months of silent failures

Build a three-document Evaluation Cadence Package — cadence calendar, RACI matrix, evaluation debt register — operational governance that keeps evaluation from becoming invisible work

Skipping Tier 2 (monthly deep dive) accumulates silent risk — chronic degradation invisible to weekly checks and quarterly audits too late — judge drift example shows 12 weeks of incorrect SQL approval

Apply cadence reasoning to judge drift, RACI conflicts, and evaluation debt risk translation — knowledge check scenarios

Three-tier pyramid — weekly checks (52×/year, acute issues), monthly deep dives (12×/year, chronic degradation), quarterly audits (4×/year, strategic alignment)

RACI ownership assigns one Accountable person per tier — debt register tracks deferred work with risk levels

Next: Communicating AI product impact