How little is enough

Decision	Precision Requirement	Typical n
Ship (100% rollout)	±3-5% CI	200-400
Hold / Ramp (limited beta)	±7-10% CI	50-150
Experiment (v1 vs v2)	80% chance to detect real improvement	Depends on effect

Decision Type	90% Confidence	95% Confidence
Ship (±5% CI)	n = ___, rationale	n = ___, rationale
Hold (±10% CI)	n = ___, rationale	n = ___, rationale
Experiment	n = ___, rationale	n = ___, rationale

Calculate CI for 70% with n=50, power analysis for 5-point gain, judge kappa adjustment

Question 1

Your v1 has 35/50 queries correct (70% success rate). PM asks: can we ship if we're 95% confident quality is above 65%? Calculate the 95% CI. Is the lower bound above 65%?

Question 2

You're planning v1 vs v2 experiment. You believe v2 will improve SQL accuracy from 84% to 89% (5-point gain). You want an 80% chance of detecting this improvement if it's real (that's called '80% power'). Approx how many samples per version? If each sample costs $0.50, what's total experiment cost?

Question 3

Your LLM evaluator for tone has a reliability score (kappa) of 0.68 compared to human reviewers. You calculated n=180 assuming perfect ground truth. How many samples do you actually need accounting for evaluator reliability?

How little is enough

v0 vs v1: what changes enable correctness measurement?

Your PM wants to ship — but 50 samples give you ±10% CI when you need ±5%

Minimum viable evaluation data uses three inputs to calculate sample size

The formula tells you exactly how many samples you need for your decision

Evaluator reliability: less-than-perfect evaluators inflate required sample sizes

Ship needs ±5%, hold tolerates ±10%, experiments use power analysis

You have 84% success rate with n=50 — is that enough to ship with 90% confidence?

84% ± 10.2% means you're 95% confident quality is between 74% and 94%

To get ±5% precision, you need 207 samples — not 50

90% confidence drops required n from 207 to 146 samples

Tighter precision costs exponentially more samples

Fill a 6-scenario decision matrix with sample sizes and evidence sufficiency arguments

CI width plot: current n=50, ship threshold n=207, diminishing returns

Magic number thinking, ignoring judge imperfection, premature precision

Calculate CI for 70% with n=50, power analysis for 5-point gain, judge kappa adjustment

The sample size decision tree from decision type to required n

Trace design and reproducibility, with a tooling lab