Dataset Lifecycle

Lesson 5.2
Week 5 · Evaluation Infrastructure
Shane Butler · AI Analyst Lab

In Lesson 5.1, you built a transition failure matrix for the AI Data Analyst pipeline. Which pipeline transition had the highest failure rate, and what does that tell you about where to instrument more carefully in future versions?

Query
Natural language
Retrieval
Schema fetch
SQL
Query gen
Execution
Run query
Viz
Chart render
Narrative
Explanation
Each transition edge: Where did failures cluster?

78% offline, 52% in production — the test set drifted and no longer matched production

Pre-Launch Testing (Q1)
200 test cases from 6 months ago
Simple lookup queries dominate
78%
Pass rate
Production (Q3)
Users now ask multi-table joins
and trend comparisons
52%
Pass rate
The drift arrow: 6 months
Your offline metric measured yesterday's distribution, not today's

Four stages with disciplined transitions keep your evaluation datasets valid across the life of your product

Creation
Split data first
Development
Version everything
Test Set Management
Refresh regularly
Retirement
Archive with full history
Each transition has a decision gate — you don't skip stages or cycle backward

Split-first rule — partition data into train/dev/holdout before any metric development

traces_v1_full.jsonl
1000 traces from production
Train (10%)
100 traces
Training the AI judge (the system that checks outputs)
Dev (45%)
450 traces
Where you refine and test quality checks
Holdout (45%) — LOCKED
450 traces · Read-only until final pre-ship validation
BEFORE any judge development or metric calibration
Split first. Otherwise you're teaching to the test — your quality metrics look good on the data you tuned against, but fail on new data.

Versioning begins here — every dataset change gets a version tag and changelog entry

v1.0 — 2025-09-15
200 traces · Initial dataset from production sample
v1.1 — 2025-10-10
220 traces · Added 20 multi-table join queries
v1.2 — 2025-11-01
220 traces · Corrected 5 labels after judge review
Train and Dev are active — Holdout remains locked
Judge prompts refined on dev, metric thresholds calibrated on dev. Holdout never gets touched.

Rotation is the fix for inevitable staleness — refresh every N months or after major product changes

Test Set v2.3
200 examples
Quarterly rotation cadence
Retire bottom 20% by age
Add fresh examples from production
Drift Detection
KS test (distribution difference check) p-value threshold: 0.05
If test set no longer matches production → rotate immediately
Most ML models experience measurable performance degradation within their first year in production
Staleness is inevitable. Rotation is the mitigation.

Archive with full provenance — reproducibility requires access to historical datasets

Retired Datasets — Read-Only Archive
Full provenance for reproducibility and audit trail
Regression Suite v1.8
Retired 2025-Q3
Reason: Product architecture change
Judge Calibration Set v2.0
Retired 2025-Q4
Reason: Distribution drift exceeded threshold
Initial Dev Set v1.0
Retired 2025-Q2
Reason: Graduated to regression suite
Do not delete
Ship decisions from Q2 that reference v2.3 must remain verifiable in Q4

Five rules govern the entire lifecycle

  • Split-first rule — partition before any development work
  • Versioning rule — every change gets a tag and changelog
  • Holdout protection rule — read-only until final pre-ship validation
  • Rotation cadence — refresh quarterly or after major product changes
  • Contamination detection — monitor whether test set still matches production patterns
These rules prevent teaching to the test and staleness
Without these disciplines, your offline metrics stop reflecting production reality

Will holdout judge agreement score be higher, lower, or the same as 0.82?

Scenario: Test set: 150 queries. Judge dev-set agreement score (Kappa — how often the AI judge agrees with human experts) improved from 0.65 to 0.82 over 12 rounds of improving the judge's instructions.
Higher
Lower
Same

Split statistics: 20/90/90, split to match proportions across failure categories

Split Count Retrieval Failures SQL Errors Hallucinations Clean
Train 20 15% 25% 10% 50%
Dev 90 16% 24% 11% 49%
Holdout 90 14% 26% 10% 50%
Original 200 15% 25% 10% 50%
Proportions preserved — each split matches original distribution
If train had 50% retrieval failures but original had 15%, the split didn't maintain proportions correctly

Distribution divergence: KS = 0.18, p = 0.03 on query complexity — test set no longer matches production

Test Set
Simple lookups
55%
Multi-table joins
25%
Trend analyses
15%
Comparisons
5%
Production (recent)
Simple lookups
35%
Multi-table joins
40%
Trend analyses
18%
Comparisons
7%
KS statistic = 0.18, p = 0.03 → Test set drifted
Production users now ask more complex queries than your test set represents

Build a Dataset Management Spec with rotation policy, contamination checks, and retirement criteria

Base Version (20-25 min)
  • Split 200 traces into train/dev/holdout
  • Version the dataset
  • Run divergence detection
  • Identify rotation candidates
  • Fill in Dataset Management Spec template
Extend Version (+10-15 min)
  • Implement divergence from scratch using scipy
  • Investigate split ratio effects on holdout power
  • Simulate rotation impact on metric stability
Deliverable: Dataset Management Spec with rotation policy and contamination checks
Plus divergence detection results showing which dimensions drifted

Dataset Management Spec flagged 0.18 KS divergence on query complexity

Dataset Management Spec
Dataset Name: AI Data Analyst Test Set
Current Version: v2.3
Lifecycle Stage: Test Set Management
Split Config: 10% / 45% / 45%
Rotation Policy: Quarterly or KS p < 0.05
Contamination Checks: Monthly divergence scan
Holdout Protection: Locked until final validation
Retirement Criteria: Drift threshold or architecture change
Divergence Detection
KS = 0.18, p = 0.03
Query complexity distribution has drifted significantly. Production users now favor multi-table joins (40%) vs test set (25%).
This is share-ready work — evidence that your evaluation data is trustworthy
In lesson 5.5, the launch readiness gate will ask: is the evaluation dataset current?

Never rotating your suite means measuring yesterday's distribution, not today's

Q1: Build suite
150 examples
70% simple lookups
Q2-Q3: Behavior shifts
Multi-turn queries increase
Q4: Ship decision
78% offline
55% production
Offline metric measured the wrong distribution
Most ML models experience measurable performance degradation within their first year in production. Staleness is not an edge case.
Fix: Run distribution divergence quarterly and rotate when KS p-value < 0.05

Three judgment scenarios on drift, splits, and rotation

Scenario 1
Test set created 9 months ago. KS p-values (distribution difference test): 0.03 (query complexity), 0.42 (user role), 0.01 (failure category). Which dimensions drifted? What should you do?
Scenario 2
Split: 80% dev, 20% holdout. After 5 months of iteration: dev-set agreement score (Kappa) = 0.85, holdout agreement score = 0.71. Your colleague says 'the judge is bad.' What's a more likely explanation?
Scenario 3
200 test set examples. Production generates 5000 new traces per quarter. Quarterly rotation policy. How many to retire each quarter? What criteria guide which ones to retire?
Pause here and work through these judgment questions

Four-stage timeline: Creation → Development → Test Set Management → Retirement

Stage 1: Creation
Split BEFORE any development work. Train (10%), Dev (45%), Holdout (45%) — locked immediately.
Stage 2: Development
Iterate on train and dev only. Version everything (like code): v1.0 → v1.1 → v1.2. Holdout stays locked.
Stage 3: Test Set Management
Dev graduates to locked test set. Rotation begins: retire old examples, add fresh production samples. Monitor drift.
Stage 4: Retirement
Archive with full history when drift exceeds threshold or product changes. Read-only for reproducibility.
Five governing rules: split-first, version everything, protect holdout, rotate on cadence, detect contamination
Each transition has a decision gate — you don't skip stages or cycle backward

Next: Experiment Design for AI Systems (Dealing with Randomness)

Prompt A
Metric ± uncertainty range
Prompt B
Metric ± uncertainty range
Statistical test: Is the difference real or random noise?
How do you design an A/B test when your system gives different answers each time?
AI Analyst Lab | AI Evals for Product Dev | Week 5 Lesson 2 | aianalystlab.ai