Distributional thinking

Error	What Teams Do	What Goes Wrong
Optimistic single-sample estimates	"62% pass rate — ship it"	Re-running yields 55%. The estimate did not generalize.
Capability vs. reliability confusion	"pass@5 (at least 1 of 5 succeeds) = 0.90 — works great"	reliable@5 (all 5 succeed) = 0.60. Half of users hit at least one failure.
Variance source confusion	"System is too random — lower the temperature"	85% of variance was evaluator noise, not system non-determinism.

Trial	Outcome
1	Correct SQL, correct narrative
2	Correct SQL, minor hallucination
3	SQL syntax error
4	Correct SQL, correct narrative
5	Correct SQL, overstated confidence

k	pass@k	pass@k CI	reliable@k	reliable@k CI
1	0.62	[0.54, 0.70]	0.62	[0.54, 0.70]
3	0.72	[0.64, 0.80]	0.24	[0.18, 0.32]
5	0.78	[0.70, 0.85]	0.52	[0.44, 0.60]
10	0.94	[0.88, 0.98]	0.28	[0.20, 0.36]

Interpret segment-level distributions, compute variance gaps, and write evidence-based ship criteria

Base Version (all students, 20 min)

1. Read segment summary from w1_l4_segment_summary.csv
2. Compute variance gap per segment: gap = pass@5 - reliable@5
3. Identify which segment has the largest gap
4. Write ship criteria: "Ship if reliable@5 > [threshold] with CI width < [threshold]"
5. Recommend ship/ramp/hold per segment

Deliverable: Three evidence-based ship/ramp/hold decisions

Extend Version (DS/Eng, +15 min)

1. Load 50 queries x 5 trials from w1_l4_multitrial_traces.jsonl
2. Compute pass@k and reliable@k for k = 1, 3, 5
3. Bootstrap 95% CIs using aieval.uncertainty.bootstrap_ci()
4. Segment by query complexity and compare variance
5. Plot pass@5 vs reliable@5 by segment with CI error bars

Deliverable: Distributional quality report with self-computed CIs

Distributional thinking

How many distinct failure categories did you discover in the v0 traces, and which appeared most often?

62% success rate from 100 traces — your PM asks "is that good enough to ship?" and you cannot answer

Without distributional thinking, teams make three systematic errors that all lead to shipping on noise

Every quality metric needs error bars — a point estimate from 5 trials could be 40% or 80%

Aggregate pass@5 of 0.78 hides that complex queries fail at twice the rate of simple lookups

System variance and evaluator variance need different fixes — decompose before you mitigate

Full rollout, gradual ramp, or hold — the decision depends on CIs, segments, and variance, not a single number

Five trials — predict pass@5, reliable@5, and dominant variance source

pass@5 = 0.67 but reliable@5 = 0.01 — a 66-point gap on a single query

Bootstrap CI for pass@5 is [0.40, 0.80] — five trials is not enough evidence for a ship decision

Across 50 queries with known correct answers, pass@5 = 0.78 but reliable@5 = 0.52 — system variance accounts for 90%

Interpret segment-level distributions, compute variance gaps, and write evidence-based ship criteria

Your distributional quality report quantifies the 26-point gap that blocks shipping

Five ways teams misuse distributional evidence — all lead to shipping on noise

Three judgment calls: segment gaps, CI-vs-threshold reasoning, and variance decomposition priority

Pass@5 vs reliable@5 by segment — the gap is user experience risk

Next: The cost-latency-quality frontier