Distributional thinking

Week 1 · Lesson 4 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

How many distinct failure categories did you discover in the v0 traces, and which appeared most often?

Quick recall from Lesson 1.3 — you built a failure taxonomy by clustering traces bottom-up.
Your Failure Taxonomy
Category 1: _______________
Category 2: _______________
Category 3: _______________
Most Frequent
Category: _______________
Frequency: _____ %

62% success rate from 100 traces — your PM asks "is that good enough to ship?" and you cannot answer

100 traces
62% produce acceptable outputs
PM asks
Is 62% good enough to ship?
Three unknowns
You cannot answer
Would re-running yield 58% or 67%?
Re-run uncertainty — no error bars
Is it 5% catastrophic or 38% mediocre?
Failure distribution shape — no segments
System non-determinism or measurement noise?
Variance source — not separated
Quality is a distribution, not a number.

Without distributional thinking, teams make three systematic errors that all lead to shipping on noise

Error What Teams Do What Goes Wrong
Optimistic single-sample estimates "62% pass rate — ship it" Re-running yields 55%. The estimate did not generalize.
Capability vs. reliability confusion "pass@5 (at least 1 of 5 succeeds) = 0.90 — works great" reliable@5 (all 5 succeed) = 0.60. Half of users hit at least one failure.
Variance source confusion "System is too random — lower the temperature" 85% of variance was evaluator noise, not system non-determinism.
Distributional thinking is the paradigm shift: from "did it work?" to "how often does it work, under what conditions, and with what variance?"

Every quality metric needs error bars — a point estimate from 5 trials could be 40% or 80%

Point Estimate
pass@5 = 0.67
No context. Could be signal or noise. Cannot defend this claim.
With Confidence Interval
pass@5 = 0.67
[95% CI: 0.40, 0.80]
Wide CI = insufficient evidence. True value could plausibly be 0.40 or 0.80.
Bootstrap: take your results, randomly re-draw samples (allowing repeats), recompute the metric each time. The spread of results gives you the confidence interval — no assumptions about the data shape needed.

Aggregate pass@5 of 0.78 hides that complex queries fail at twice the rate of simple lookups

Simple Lookup
pass@5 = 0.92
Revenue totals, single-table queries
Multi-Table Join
pass@5 = 0.45
Cross-segment comparisons, complex SQL
Trend Analysis
pass@5 = 0.78
Time-series patterns, period-over-period
Aggregate pass@5 = 0.78 — looks acceptable. But your power users running complex analytical queries hit failures at more than twice the rate.

System variance and evaluator variance need different fixes — decompose before you mitigate

System Variance
Re-run the AI Data Analyst on the same query, get different outputs each time.
Fix with: temperature reduction, retry logic, prompt engineering
Evaluator Variance
Re-run the evaluator on the same fixed output, get different scores each time.
Fix with: evaluator calibration, rubric refinement, consensus scoring
In this lesson, our correctness checks (comparing against known right answers) are deterministic — evaluator variance is near zero. In L3.5, you will encounter cases where evaluator variance dominates. If you have not decomposed, you waste effort fixing the wrong thing.

Full rollout, gradual ramp, or hold — the decision depends on CIs, segments, and variance, not a single number

Full Rollout
Narrow CIs + all must-pass metrics clear thresholds. Evidence is sufficient.
Gradual Ramp
Wider CIs tolerated if direction is positive. Deploy to 10%, monitor, expand.
Hold / Rollback
Evidence insufficient or guardrails fail. CIs include the threshold — cannot distinguish pass from fail.
The question is not "what is the metric?" — it is "does the evidence support the proposed action?"

Five trials — predict pass@5, reliable@5, and dominant variance source

Query: "What was Q4 mobile checkout conversion?" — 5 runs, temp 0.7
Trial Outcome
1 Correct SQL, correct narrative
2 Correct SQL, minor hallucination
3 SQL syntax error
4 Correct SQL, correct narrative
5 Correct SQL, overstated confidence
  • Predict pass@5 (at least 1 trial succeeds)
  • Predict reliable@5 (all 5 succeed)
  • Dominant variance source: system or judge?
Write predictions before seeing computed results.

pass@5 = 0.67 but reliable@5 = 0.01 — a 66-point gap on a single query

pass@5
0.672
reliable@5
0.010
Gap
0.662
66 percentage points
Single-trial success rate: p = 2/5 = 0.40
pass@5 (chance at least 1 of 5 works) = 1 - 0.65 = 0.672
reliable@5 (chance all 5 work) = 0.45 = 0.010

Bootstrap CI for pass@5 is [0.40, 0.80] — five trials is not enough evidence for a ship decision

Bootstrap Distribution of pass@5
0.40 (2.5th) 0.67 (observed) 0.80 (97.5th)
Results
pass@5
0.67 [0.40, 0.80]
reliable@5
0.01 [0.00, 0.10]
CI width of 0.40 on pass@5. The true value could be anywhere from a coin flip to strong capability. Five trials is not enough evidence.

Across 50 queries with known correct answers, pass@5 = 0.78 but reliable@5 = 0.52 — system variance accounts for 90%

k pass@k pass@k CI reliable@k reliable@k CI
1 0.62 [0.54, 0.70] 0.62 [0.54, 0.70]
3 0.72 [0.64, 0.80] 0.24 [0.18, 0.32]
5 0.78 [0.70, 0.85] 0.52 [0.44, 0.60]
10 0.94 [0.88, 0.98] 0.28 [0.20, 0.36]
Variance Decomposition
System variance (90%)
10%
Remaining (10%) — our correctness checks give the same answer every time
The 26-point gap between pass@5 and reliable@5 means nearly half of users experience at least one failure in 5 interactions. System variance drives 90% — focus mitigation there.
Note: Aggregate reliable@k is non-monotonic (0.62 → 0.24 → 0.52 → 0.28) because it averages across queries with different success rates. For any single query, reliable@k = pk always decreases as k increases.

Interpret segment-level distributions, compute variance gaps, and write evidence-based ship criteria

Base Version (all students, 20 min)
1. Read segment summary from w1_l4_segment_summary.csv
2. Compute variance gap per segment: gap = pass@5 - reliable@5
3. Identify which segment has the largest gap
4. Write ship criteria: "Ship if reliable@5 > [threshold] with CI width < [threshold]"
5. Recommend ship/ramp/hold per segment
Deliverable: Three evidence-based ship/ramp/hold decisions
Extend Version (DS/Eng, +15 min)
1. Load 50 queries x 5 trials from w1_l4_multitrial_traces.jsonl
2. Compute pass@k and reliable@k for k = 1, 3, 5
3. Bootstrap 95% CIs using aieval.uncertainty.bootstrap_ci()
4. Segment by query complexity and compare variance
5. Plot pass@5 vs reliable@5 by segment with CI error bars
Deliverable: Distributional quality report with self-computed CIs

Your distributional quality report quantifies the 26-point gap that blocks shipping

Portfolio Artifact — Distributional Quality Report
Simple
0.92
0.85
Multi-join
0.45
0.20
Trend
0.78
0.55
pass@5 reliable@5
What I built — A distributional quality report that quantifies the capability-reliability gap by segment, with evidence-based ship decisions. Simple: Ship. Multi-join: Hold. Trend: Ramp.

Five ways teams misuse distributional evidence — all lead to shipping on noise

Confusing pass@k with reliability
Report pass@5 = 0.90 as "works 90% of the time" — but reliable@5 = 0.60
Point estimates without CIs
Ship on "62% success" without knowing CI is [0.52, 0.72] — includes hold threshold
Attributing all variance to the system
Spend a sprint re-prompting when 85% of variance was evaluator noise
Using pass@k alone for ship decisions
Ship on pass@10 = 0.98 — but reliable@10 = 0.28, users hit failures constantly
Ignoring worst-segment risk in aggregates
Aggregate pass@5 = 0.85 looks fine — but multi-join segment has pass@5 = 0.45

Three judgment calls: segment gaps, CI-vs-threshold reasoning, and variance decomposition priority

Scenario 1
PM says "80% pass rate — ship it" but multi-join segment has reliable@5 = 0.22. What evidence do you present?
Scenario 2
pass@5 = 0.78 [95% CI: 0.68, 0.88]. Ship criterion is reliable@5 > 0.80. Can you ship?
Scenario 3
System variance = 15% of total. Judge variance = 85%. Your team wants to re-prompt the model. What do you prioritize?

Pass@5 vs reliable@5 by segment — the gap is user experience risk

Simple
0.92
0.85
Small gap
Multi-join
0.45
0.20
Variance gap
Trend
0.78
0.55
Moderate
pass@5 (capability: can it do it?) reliable@5 (reliability: does it always do it?)
Gap = user experience risk

Next: The cost-latency-quality frontier

You quantified the gap between capability and consistency. Next, you confront the tradeoff: reducing variance costs money and time. Retries improve reliability but increase latency. Better models improve quality but increase cost. The frontier defines what is achievable — and forces you to choose.
AI Analyst Lab™ | AI Evals for Product Dev | Week 1 · Lesson 4 | aianalystlab.ai