Regression safety net

Week 2 Lesson 4 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

Which trace fields did v1 add that make regression testing possible?

A "minor" prompt update ships — and breaks multi-table joins 20 minutes later

Code Review
Prompt update improves SQL readability
Deploy
Merge and deploy to production
20 minutes later
Slack: System won't handle multi-table joins
New prompt template omits critical instruction about foreign key relationships — breaks a failure mode fixed 6 months ago

Traditional regression testing assumes deterministic outputs. AI systems don't give you that.

Traditional Software
Run test → check exact match
Same input → same output
Pass once = always passes
AI Systems
Same input → different outputs
Suite passes run 1, fails run 2
Random variation looks like real breakage
Your regression suite might pass on one run and fail on the next purely due to variance

Start with your failure taxonomy — add one trace per "evaluator-needed" category

Failure taxonomy
From L1.3
Filter
Needs ongoing measurement
Sample
One trace per category
Expand
Add edge cases, known-correct answers
Suite
Curated collection
Pick test cases so every major failure category is represented — you're not testing everything, just the cases you care about

Not all failures block deployment — distinguish blocking from optimization metrics

Metric Type Pass Criteria Gate Behavior Example
Blocking Must pass 100% on regression suite Block deployment if any fail SQL correctness, policy compliance
Optimization Must not regress below threshold Block if regress beyond tolerance Narrative quality, latency
Informational No gate — logged only Never blocks Retrieval diversity, token usage
Blocking metrics give the same result every time. Optimization metrics improve over time but don't stop a release unless they regress significantly.

Run each test case multiple times. Use pass@k for capability, reliable@k for consistency.

Single Trial
Run once → Pass or Fail
Variance looks like regression
Flaky CI
Multi-Trial (run it 3 times)
Run 3 times → aggregate
Can do it (pass@3): ≥1 pass
Consistent (reliable@3): all 3 pass
Example: pass@3 = 7/10 means 7 out of 10 test cases passed at least once in 3 tries

The regression suite is not a dashboard — it's an automated decision gate

PR submitted
CI triggered
GitHub Actions runs suite
All gates pass?
No → PR blocked
Review required
Yes → PR approved
Merge allowed

Add new failures when discovered in production. Retire cases that no longer add value.

Production failure discovered
Add to regression suite
Suite version incremented
3+ releases with 100% pass → Retire case
The suite is a living artifact

48 of 50 cases passed. You re-run without changing code. How many pass the second time?

First run: 48 / 50
You re-run the same suite · No code changes
Prediction: _____ / 50 pass
Write your prediction and explain your reasoning. We'll compare to actual results next.

First run: 9/10 pass. Second run: 8/10 pass. Judge variance causes the flip.

Test Case Run 1 Result Run 2 Result Metric Type
SQL oracle 1 Pass Pass Known-answer check (stable)
SQL oracle 2 Pass Pass Known-answer check (stable)
SQL oracle 3 Fail Fail Known-answer check (stable)
Narrative judge 1 Pass (2/3 trials) Pass (2/3 trials) LLM judge (varies)
Narrative judge 2 Pass (3/3 trials) Fail (1/3 trials) LLM judge (varies)
Judge variance caused one case to flip from pass to fail — same code, different gate outcome

Oracle checks are stable. Judge checks vary. Single-trial blocking gates become flaky CI.

Oracle-based (SQL)
9/10
Run 1 and Run 2
~0% variance
Judge-based (narrative)
pass@3
Run 1: 7/10 → Run 2: 6/10
~15% variance
Oracle-based metrics can be blocking. Judge-based metrics should be optimization with tolerance bands.

Build a 20-case regression suite with metric classification and CI gating rules

  • Review L1.3 failure taxonomy — identify "evaluator-needed" categories
  • Select 20 test cases using scaffolded template (≥5 categories, ≥5 oracle queries)
  • Classify each metric as blocking / optimization / informational
  • Run regression suite with multi-trial protocols
  • Analyze variance — check how much your pass rates bounce around across re-runs
  • Define CI gating rules in plain English
  • Test rules against 5 simulated PR scenarios
  • Document suite evolution policy
Time estimate: Base: 20-25 min | Extend: +10-15 min (executable gating logic + power analysis)

A threshold table showing which metrics block deployment and which warn

Metric Type Threshold Current Status
SQL correctness Blocking 100% 95% BLOCKS
Policy compliance Blocking 100% 100% Pass
Latency p95 Optimization ≤110% baseline 108% Pass
Narrative quality Optimization ≥80% pass@3 75% WARN
Token usage Informational N/A 1,200 avg Log
What I built: A 20-case regression suite with automated CI gates that block deployment when SQL correctness fails and warn when latency or quality regress beyond tolerance.

Over-gating creates flaky CI. Under-gating ships regressions. Tolerance mismatch does both.

Over-gating (flaky CI)
Blocking on high-variance metrics
Every PR fails due to noise
Team ignores gates
Fix: Multi-trial + tolerance bands
Under-gating (meaningless suite)
All metrics "informational"
Nothing blocks deploy
Regressions ship anyway
Fix: At least 1 blocking metric
Tolerance mismatch
Too tight → flaky CI
Too loose → regressions slip
Not based on empirical variance
Fix: Derive from observed variance

A judge scores Pass, Fail, Pass across 3 trials. Should it be blocking or optimization?

Scenario: Narrative quality on a complex query
Judge results (3 trials):
Trial 1: Pass
Trial 2: Fail
Trial 3: Pass

Your gating rule: "Block deployment if any blocking metric fails"

Question: Should narrative quality be classified as blocking or optimization? Why?
A
Blocking — majority passed
B
Optimization — shows variance

The regression suite as a quality gate in the deployment pipeline

1. PR submitted
2. CI triggered (GitHub Actions)
3. Regression suite execution
Blocking metrics
Oracle checks → Binary pass/fail → If fail → BLOCK
Optimization metrics
Multi-trial → Compare to threshold → If regress → BLOCK
Informational metrics
Log only → Continue

Gate outcome: block or approve based on metric results

PR blocked
Review required (failure details linked)
PR approved
Merge allowed
Production feedback loop
New production failures → Add to suite
Suite version incremented

How instrumentation requirements differ by system type

Next: Lesson 2.5
"How instrumentation requirements differ by system type"

RAG pipeline
Linear traces with retrieval and generation stages
Agent workflow
Branching traces with tool use and reasoning loops
Batch processing
Aggregate-level metrics instead of per-query traces
Same evaluation principles — different trace structures and checkpoint placement
AI Analyst Lab | AI Evals for Product Dev | Week 2 Lesson 4 | aianalystlab.ai