Regression safety net

Metric Type	Pass Criteria	Gate Behavior	Example
Blocking	Must pass 100% on regression suite	Block deployment if any fail	SQL correctness, policy compliance
Optimization	Must not regress below threshold	Block if regress beyond tolerance	Narrative quality, latency
Informational	No gate — logged only	Never blocks	Retrieval diversity, token usage

Test Case	Run 1 Result	Run 2 Result	Metric Type
SQL oracle 1	Pass	Pass	Known-answer check (stable)
SQL oracle 2	Pass	Pass	Known-answer check (stable)
SQL oracle 3	Fail	Fail	Known-answer check (stable)
Narrative judge 1	Pass (2/3 trials)	Pass (2/3 trials)	LLM judge (varies)
Narrative judge 2	Pass (3/3 trials)	Fail (1/3 trials)	LLM judge (varies)

Build a 20-case regression suite with metric classification and CI gating rules

Review L1.3 failure taxonomy — identify "evaluator-needed" categories
Select 20 test cases using scaffolded template (≥5 categories, ≥5 oracle queries)
Classify each metric as blocking / optimization / informational
Run regression suite with multi-trial protocols
Analyze variance — check how much your pass rates bounce around across re-runs
Define CI gating rules in plain English
Test rules against 5 simulated PR scenarios
Document suite evolution policy

Time estimate: Base: 20-25 min | Extend: +10-15 min (executable gating logic + power analysis)

Metric	Type	Threshold	Current	Status
SQL correctness	Blocking	100%	95%	BLOCKS
Policy compliance	Blocking	100%	100%	Pass
Latency p95	Optimization	≤110% baseline	108%	Pass
Narrative quality	Optimization	≥80% pass@3	75%	WARN
Token usage	Informational	N/A	1,200 avg	Log