Ground truth + regression

Week 3 Lesson 2 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

In Lesson 1.3, you performed failure discovery on v0 traces and built a failure taxonomy through bottom-up clustering — which failure category had the highest user impact, and why did you prioritize it for evaluation coverage?

Write your answer before moving forward.

"90% compared to what?"

Metric
SQL correctness
90%
Question
"90% compared to what?"

Half the "expected answers" came from a wrong model

Previous model
18% error rate
Copied to suite
200 test cases
Tests report
Everything passing
Production
Wrong numbers
Ground truth is the foundation of every metric you build in this course.

Match the right ground truth source to the right task

Source Cost Delay Use for
Execution Oracles (automated checks) Near zero Seconds SQL correctness, tool outputs
Expert Annotation 2-3 min/trace Hours to days Narrative quality, subjective judgments
User Feedback Signals None Days to weeks Production validation (noisy)
Use execution oracles wherever you can. Reserve expert annotation for tasks requiring human judgment.

Different SQL producing the same correct result gets full credit

Two Queries
Query A: uses quarter column
Query B: explicit date filters
Oracle Check
Run both queries, compare results
aieval.oracles.score_sql()
→ Pass if results match

Reserve annotation budget for tasks requiring human judgment

Don't annotate
SQL correctness
Tool outputs
Retrieval results
Use oracle instead
Do annotate
Narrative quality
Faithfulness
Completeness
Requires human judgment
Oracle: seconds, $0.001. Annotation: 2-3 min, $1-3 labor cost.

The suite is the blocking gate in your release process

System v2
New version
Regression suite
Run all tests
100% pass?
Decision gate
Ship or triage
Release decision
Start: 10-20 high-impact cases → Promotion cycles → Evolves with product

Four gates prevent bloat and maintain relevance

  • Reproducible — with current instrumentation
  • Distinct — represents failure mode not already covered
  • Verified ground truth — from appropriate source
  • Justifies cost — severity or frequency warrants maintenance
PROMOTE
Novel SQL logic error affecting revenue queries
REJECT
Third variation of missing-semicolon syntax error

Fill coverage gaps with structured test case engineering

1. Identify uncovered segments
From your eval surface map — the matrix of what your AI does and where it can fail (L1.2)
2. Define attribute combinations
(ambiguous name, trend question, complex query)
3. Generate queries
LLM constrained by those attributes
4. Filter for realism
"Would a real user ask this?"

Which 3 failure categories should you prioritize for initial regression suite coverage, and why?

You have the AI Data Analyst's failure taxonomy from Week 1 (8+ distinct categories). Limited annotation budget.
Write your answer before running the next cell.

Oracle evaluation is result-based, not string-based

Code demonstration
aieval.oracles.score_sql(generated_sql, expected_results, db_path, tolerance_config)
Translation: Run both queries against the database, compare results, allow for small rounding differences
Generated SQL
Incomplete data (missing months)
Expected SQL
Complete data
Result: FAIL — result sets do not match
Different SQL producing the same correct results gets full credit.

Calibration on 20 shared examples was the highest-impact quality investment

Initial
Agreement: 0.45 (Poor)
Rubric refinement
3 rounds of calibration
After calibration
Agreement: 0.78 (Good)
Kappa measures how consistently two reviewers agree (0 = random, 1 = perfect)
Rubric dimension Question Scoring
Does narrative match data? Faithfulness Pass/Fail
Covers key findings? Completeness Pass/Fail

Trace A promotes, Trace B and C reject

Trace Reproducible Distinct Verified GT Justifies Cost Decision
Trace A ✓ (oracle) ✓ (high severity) PROMOTE
Trace B REJECT — Fails distinctiveness
Trace C ✗ (user error) REJECT — Fails cost justification

Build a regression suite promotion plan with verified ground truth

Base version (20-25 min)
1. Complete oracle comparison for 3 SQL traces
2. Annotate 3 narrative traces, check your agreement score (Kappa)
3. Evaluate 4 candidate traces (promote exactly 2)
4. List 3 coverage gaps from evaluation surface map
5. Apply realism rubric to 5 synthetic queries
6. Fill Regression Suite Promotion Plan template
Extend version (+10-15 min)
1. Implement SQL result comparison from scratch
2. Implement Cohen's Kappa from scratch
3. Define attribute combinations, generate synthetic queries
4. Implement realism filter

Regression suite coverage table showing before/after promotion

Before
SQL logic errors: 2
Retrieval failures: 0
Hallucinations: 1
SQL syntax errors: 3
Missing context: 0
After
SQL logic errors: 5 (+3)
Retrieval failures: 3 (+3)
Hallucinations: 2 (+1)
SQL syntax errors: 3
Missing context: 2 (+2)
What I built — a regression suite promotion plan with 7 new test cases covering SQL correctness (oracle-verified) and narrative quality (expert-annotated), plus a synthetic data generation recipe for retrieval failure gaps.

Promoting duplicates bloats the suite, unrealistic synthetics create false baselines

Bloat from duplicates
30 test cases: all "missing semicolon variations"
Run time: 15 minutes
Team stops running suite
Unrealistic synthetics
LLM judge pass rate: 95%
Production user satisfaction: 70%
Test vs. reality gap

Wrong ground truth sources and skipped realism filtering

Scenario 1
AI generates SQL that runs successfully but returns the wrong result. What ground truth source should you use, and why?
Scenario 2
Team has 200-case regression suite, 15 min run time. PM asks: "Why can't we run this in CI on every commit?" What's the problem, what do you recommend?
Scenario 3
You generate 10 synthetic queries, skip realism filtering, promote all 10. Three months later: LLM judge 95% pass, production user satisfaction 70%. What went wrong, how do you fix it?

Regression Suite Construction Workflow

1
Ground Truth Sources
Execution Oracles (automated checks, $0.001, seconds) • Expert Annotation (2-3 min/trace) • User Feedback (noisy)
2
Regression Suite Gate
Reproducible • Distinct • Verified ground truth • Justifies cost
3
Evolution Loops
Synthetic generation (coverage gaps → dimension tuples → realism filter) • Production discovery

Next: Deriving evaluation signals from available ground truth

You'll learn how to extract evaluation signals when you don't have perfect ground truth — using proxy metrics, partial verification, and signal degradation analysis.
AI Analyst Lab | AI Evals for Product Dev | Week 3 Lesson 2 | aianalystlab.ai