Ground truth + regression

Source	Cost	Delay	Use for
Execution Oracles (automated checks)	Near zero	Seconds	SQL correctness, tool outputs
Expert Annotation	2-3 min/trace	Hours to days	Narrative quality, subjective judgments
User Feedback Signals	None	Days to weeks	Production validation (noisy)

Rubric dimension	Question	Scoring
Does narrative match data?	Faithfulness	Pass/Fail
Covers key findings?	Completeness	Pass/Fail

Trace	Reproducible	Distinct	Verified GT	Justifies Cost	Decision
Trace A	✓	✓	✓ (oracle)	✓ (high severity)	PROMOTE
Trace B	✓	✗	✓	✓	REJECT — Fails distinctiveness
Trace C	✓	✓	✓	✗ (user error)	REJECT — Fails cost justification

Build a regression suite promotion plan with verified ground truth

Base version (20-25 min)

1. Complete oracle comparison for 3 SQL traces
2. Annotate 3 narrative traces, check your agreement score (Kappa)
3. Evaluate 4 candidate traces (promote exactly 2)
4. List 3 coverage gaps from evaluation surface map
5. Apply realism rubric to 5 synthetic queries
6. Fill Regression Suite Promotion Plan template

Extend version (+10-15 min)

1. Implement SQL result comparison from scratch
2. Implement Cohen's Kappa from scratch
3. Define attribute combinations, generate synthetic queries
4. Implement realism filter

Ground truth + regression

In Lesson 1.3, you performed failure discovery on v0 traces and built a failure taxonomy through bottom-up clustering — which failure category had the highest user impact, and why did you prioritize it for evaluation coverage?

"90% compared to what?"

Half the "expected answers" came from a wrong model

Match the right ground truth source to the right task

Different SQL producing the same correct result gets full credit

Reserve annotation budget for tasks requiring human judgment

The suite is the blocking gate in your release process

Four gates prevent bloat and maintain relevance

Fill coverage gaps with structured test case engineering

Which 3 failure categories should you prioritize for initial regression suite coverage, and why?

Oracle evaluation is result-based, not string-based

Calibration on 20 shared examples was the highest-impact quality investment

Trace A promotes, Trace B and C reject

Build a regression suite promotion plan with verified ground truth

Regression suite coverage table showing before/after promotion

Promoting duplicates bloats the suite, unrealistic synthetics create false baselines

Wrong ground truth sources and skipped realism filtering

Regression Suite Construction Workflow

Next: Deriving evaluation signals from available ground truth