Pipeline architecture

Stage	Metrics	Intensity
Offline (pre-prod)	All metrics, all segments, all traces	100%
Shadow (parallel, no users)	Sampled, blocking metrics only	50%
Beta	Targeted segments, optimization tracked	30%
Experiment	Statistical comparison, decision metrics	50%
Production	Alerts on blocking, cost-constrained	15%

	query	intent	context	sql_gen	sql_exec	chart	narrative
query	-	✓	-	-	-	-	-
intent	-	-	✓	-	-	-	-
context	-	-	-	✓	-	-	-
sql_gen	-	-	-	-	23%	-	-
sql_exec	-	-	-	-	-	✓	-
chart	-	-	-	-	-	-	18%

Failure Mode	Consequence
Results live only in notebooks or Slack	"Has SQL correctness improved?" → no one knows
Run metadata is incomplete	Cannot tell if 87% → 83% is regression or methodology change
No lineage tracking	"How has our eval evolved?" → reconstruct from memory
Treating all environments the same	Full suite on prod = $40K/month in judge costs

Metadata decisions, heatmap interpretation, environment ladder

1. Metadata check

Run A: 87% with judge_v2.1. Run B: 91% with judge_v2.3. PM asks: "Should we ship based on this 4-point improvement?" What metadata would you check?

2. Heatmap diagnosis

Transition heatmap shows context_retrieved → sql_generated has 15% failure rate (highest in pipeline). What does this tell you?

3. Environment ladder

Colleague proposes: run full suite (500 traces, 8 metrics, 3 LLM judges) on every production request. Why is this infeasible? What's the alternative?