How little is enough

Lesson 2.2
Week 2: Instrumentation and Observability
Shane Butler · AI Analyst Lab

v0 vs v1: what changes enable correctness measurement?

v0 logging
  • user_query
  • sql_generated
  • chart_type
  • chart_url
v1 logging
  • user_query
  • sql_generated
  • sql_executed
  • execution_success
  • result_row_count
  • oracle_sql
  • correctness_match

Your PM wants to ship — but 50 samples give you ±10% CI when you need ±5%

Current evidence: 42/50 correct (84% success rate)
95% Confidence Interval (CI): [73.8%, 94.2%] ± 10.2%
Ship threshold: PM wants 90% confidence that true accuracy > 80%

Gap: CI too wide to support decision.

Minimum viable evaluation data uses three inputs to calculate sample size

Decision type
Ship vs hold vs experiment requires different evidence strength
⦿
Confidence level
Driven by risk tolerance and cost of error
System uncertainty
How noisy is your AI's quality? And how noisy is the evaluator grading it? More noise = more samples (n) needed

The formula tells you exactly how many samples you need for your decision

In plain English: tighter precision or higher confidence = more samples. If your success rate is near 50%, you need the most samples of all.
n = z² × p̂(1 - p̂) / ε²
z = lookup number for your confidence level (1.96 for 95%, 1.645 for 90%)
= your current success rate (e.g., 0.84 from 42/50)
ε = your desired precision (e.g., ±5% means ε = 0.05)

Evaluator reliability: less-than-perfect evaluators inflate required sample sizes

Kappa measures how consistent your evaluator is, from 0 (random) to 1 (perfect). Most LLM evaluators score 0.6-0.8.

Human expert with answer key (kappa = 1.0)
n = 180 samples → 180 effective samples
LLM-based evaluator (kappa = 0.7)
n = 180 samples → ~126 effective samples
Requires ~257 samples for equivalent confidence

Ship needs ±5%, hold tolerates ±10%, experiments use power analysis

Decision Precision Requirement Typical n
Ship (100% rollout) ±3-5% CI 200-400
Hold / Ramp (limited beta) ±7-10% CI 50-150
Experiment (v1 vs v2) 80% chance to detect real improvement Depends on effect

You have 84% success rate with n=50 — is that enough to ship with 90% confidence?

  1. Do you think 50 samples is enough for this decision? Why or why not?
  2. If not enough, estimate how many samples you think you'd need. Write down your guess.
  3. What would change your answer — higher confidence requirement? Wider acceptable margin?

Prediction: ______ samples needed

84% ± 10.2% means you're 95% confident quality is between 74% and 94%

Baseline: 42/50 correct = 84% success rate
95% CI: [73.8%, 94.2%] ± 10.2%

This range is too wide to make a ship decision — CI includes values both above and below 80% threshold.

To get ±5% precision, you need 207 samples — not 50

Current state
n = 50 → ±10.2% CI
Ship requirement
n = 207 → ±5% CI

Gap: 157 samples needed

90% confidence drops required n from 207 to 146 samples

95% confidence
207 samples
90% confidence
146 samples
95% confidence
90% confidence

Tighter precision costs exponentially more samples

±15%
n=28
±10%
n=97
±5%
n=207
±3%
n=584

Worst-case p=0.5. With observed p=0.84, required n is lower at each level.

Fill a 6-scenario decision matrix with sample sizes and evidence sufficiency arguments

Decision Type 90% Confidence 95% Confidence
Ship (±5% CI) n = ___, rationale n = ___, rationale
Hold (±10% CI) n = ___, rationale n = ___, rationale
Experiment n = ___, rationale n = ___, rationale

Base version: read n from reference table. Extend version: calculate n from scratch + judge adjustment.

CI width plot: current n=50, ship threshold n=207, diminishing returns

CI width narrows as n grows, but with diminishing returns
±15.5%
n=10
±10.2%
n=50 (you are here)
±5.0%
n=207 (ship target)
±3.6%
n=400 (diminishing)

Magic number thinking, ignoring judge imperfection, premature precision

  • Magic number thinking → "We always use 100 samples" systematically under-samples low-quality systems
  • Ignoring evaluator imperfection → Using an unreliable evaluator (kappa=0.6, only 60% reliable) without adjusting n upward leads to 40% less evidence
  • Premature precision → Collecting 500 samples for ±3% CI when PM would accept ±8% wastes budget

Calculate CI for 70% with n=50, power analysis for 5-point gain, judge kappa adjustment

Question 1
Your v1 has 35/50 queries correct (70% success rate). PM asks: can we ship if we're 95% confident quality is above 65%? Calculate the 95% CI. Is the lower bound above 65%?
Question 2
You're planning v1 vs v2 experiment. You believe v2 will improve SQL accuracy from 84% to 89% (5-point gain). You want an 80% chance of detecting this improvement if it's real (that's called '80% power'). Approx how many samples per version? If each sample costs $0.50, what's total experiment cost?
Question 3
Your LLM evaluator for tone has a reliability score (kappa) of 0.68 compared to human reviewers. You calculated n=180 assuming perfect ground truth. How many samples do you actually need accounting for evaluator reliability?

The sample size decision tree from decision type to required n

What decision?
Ship (±5%) | Hold (±10%) | Experiment
Confidence level?
90% / 95% / 99%
Calculate & adjust n
Apply formula, adjust for evaluator reliability (kappa)
Output
Required n, current n, gap

Trace design and reproducibility, with a tooling lab

What you built today
Sample size calculator for ship/hold/experiment decisions
Next lesson: Trace design and reproducibility
How to structure traces so you can debug failures and reproduce evaluations
AI Analyst Lab | AI Evals for Product Dev | Week 2 Lesson 2 | aianalystlab.ai