Confusing pass@k with reliability
Report pass@5 = 0.90 as "works 90% of the time" — but reliable@5 = 0.60
Point estimates without CIs
Ship on "62% success" without knowing CI is [0.52, 0.72] — includes hold threshold
Attributing all variance to the system
Spend a sprint re-prompting when 85% of variance was evaluator noise
Using pass@k alone for ship decisions
Ship on pass@10 = 0.98 — but reliable@10 = 0.28, users hit failures constantly
Ignoring worst-segment risk in aggregates
Aggregate pass@5 = 0.85 looks fine — but multi-join segment has pass@5 = 0.45