Why AI eval is different

Week 1 Lesson 1 AI Evals for Product Dev
Shane Butler AI Analyst Lab

"Same question, different answers" — and the code is working correctly

Your team's investigation
  • Code review passed
  • API returning 200s
  • SQL is valid
  • Data pipeline healthy
User reports (Week 1)
"Same question gives different answers"

15% of users reported this

A single test run tells you the system CAN do this — it doesn't tell you the system WILL

Query
What was Q4 revenue?
5 runs
temperature 0.7 (randomness setting)
Results
Different outputs
4
Correct runs
Different SQL approaches, all correct answers
1
Wrong run
SQL used Q3 instead of Q4 — factual error
Different outputs. Same query. That's distributional behavior.

Teams ship on pass@k alone — reliable@k reveals what users actually experience

Metric What it measures Example
pass@k At least 1 of k trials succeeds pass@5 = 1.0 (system can do this)
reliable@k All k trials succeed reliable@5 = 0.4 (system doesn't do it consistently)
Gap Capability vs. consistency 0.6 gap = unreliable even when capable

Temperature = 0 does not eliminate variance — 3-8% inconsistency is typical

Production models at temperature = 0 (randomness off)
92-97%
Consistent outputs
3-8%
Different outputs on re-run
Consistent
Different
Due to floating-point non-determinism across GPU types, batching, and parallelism

System variance (the AI varies) vs. judge variance (your eval method varies)

System variance
Non-determinism from the AI system itself
Example: Re-run same query → different SQL output
Judge variance
Non-determinism from the evaluation method
Example: LLM-as-judge scores same output twice → different scores
This lesson focuses on system variance. Judge variance introduced Week 3.

When PM says "evaluation," they mean rubrics — when DS says it, they mean benchmarks

Role What "evaluation" means Example
PM Rubrics (quality criteria) "Does it feel high-quality to users?"
Data Scientist Benchmarks (scored metrics) "How does it compare to baseline?"
Engineer Test suites (CI/CD checks) "Did the tests pass?"
This course teaches
ship/ramp/hold/rollback decisions (synthesis of all three)

Benchmark performance doesn't predict production quality

System Benchmark Result Production Outcome
Google Bard Passed internal quality checks Factually wrong answer in public demo, $100B market cap loss (Feb 2023)
IBM Watson for Oncology Promising results in controlled settings Unsafe treatment recommendations on diverse real patient populations
High benchmark performance is necessary but not sufficient for shipping

Before running the demo: How many of 5 outputs will be identical?

1. How many of the 5 outputs will be identical?
2. Will any outputs be factually incorrect (wrong numbers or wrong time period)?
3. Will the generated SQL be the same across all 5 runs?
Write down your predictions before advancing.

All five SQL queries are different — four correct, one wrong

Run SQL excerpt Correctness
1 SELECT SUM(revenue) FROM sales WHERE quarter = 'Q4' ✓ Correct
2 SELECT total_revenue FROM quarterly_summary WHERE q = 4 ✓ Correct
3 SELECT SUM(amount) FROM transactions WHERE date >= '2025-10-01' ✓ Correct
4 SELECT SUM(revenue) FROM sales WHERE quarter = 'Q3' ✗ Wrong quarter
5 SELECT revenue_total FROM revenue_summary WHERE period = 'Q4-2025' ✓ Correct

pass@5 = 1.0 but reliable@5 = 0.4 — a 0.6 gap between capability and consistency

Q4 revenue query
1.0
pass@5
At least 1 success
0.4
reliable@5
All 5 succeed
pass@5 (capability)
reliable@5 (reliability)
→ 0.6 gap = Consistency Risk

The gap maps to ship decisions: gap < 0.2 = ship, gap > 0.6 = hold

Metrics Gap Decision Reasoning
pass@5 = 1.0, reliable@5 = 0.8 0.2 Ship Capable and mostly consistent
pass@5 = 1.0, reliable@5 = 0.4 0.6 Hold Capable but unreliable — needs mitigation
pass@5 = 0.4, reliable@5 = 0.0 0.4 Hold Not capable — model quality problem
Different gaps = different problems = different solutions

Build a Non-Determinism Report: pass@5 vs reliable@5 for 2 queries + interpretation

Base version (all students, 20 min)
  • Compute pass@5 and reliable@5 for 3 queries
  • Identify query with largest gap
  • Select 2 queries from full dataset
  • Create bar chart (pass@5 vs reliable@5)
  • Write 2-sentence interpretation
Extend version (DS/Eng, +10 min)
  • Compute pairwise similarity across 5 outputs
  • Report mean/SD of similarity scores
  • Identify query with highest variance
  • Write variance hypothesis

Your Non-Determinism Report quantifies the consistency risk your PM needs to see

Portfolio Artifact
1.0
0.4
Non-Determinism Report for AI Data Analyst v0
What I built — a Non-Determinism Report demonstrating pass@5 = 1.0 but reliable@5 = 0.4 for the Q4 revenue query, quantifying a 0.6 gap between capability and consistency.

Ship based on pass@k without checking reliable@k = users see flakiness you didn't measure

Failure Mode What Happens What To Do Instead
Ship on pass@k alone pass@5 = 0.95, reliable@5 = 0.3 → 65% of users hit inconsistency Always compute both — if gap > 0.2, add mitigation
Assume temp=0 = deterministic Single test passes, production shows 3-8% variance from hardware differences Measure variance empirically with multi-trial testing
Use benchmarks as ship decision 92% on benchmark, fails on real user queries Benchmarks screen, product eval decides

A system has pass@5 = 0.92, reliable@5 = 0.41 — is it ready to ship?

1. A system has pass@5 = 0.92 and reliable@5 = 0.41. A PM asks, "Is this ready to ship to all users?" What do you tell them and why?
2. You run the same query through an AI system 3 times at temperature 0.7 and get three different SQL queries. All three SQL queries return the same result set. Is this a failure? Why or why not?
3. A benchmark shows your model achieves 89% accuracy on a standard evaluation set. Your PM says, "Great, let's ship." What question should you ask before agreeing?

Left: 5 runs, different SQL, one failure — Right: bar chart showing capability-reliability gap

5 runs of same query
Run Result
1 Correct SQL
2 Correct SQL
3 Correct SQL
4 Wrong quarter
5 Correct SQL
Non-determinism produces different outputs — some correct, some not
Capability vs. Reliability
1.0
pass@5
0.4
reliable@5
→ Consistency Risk
The gap quantifies capability vs. reliability

Next: Product evaluation framework for AI systems

You'll map the evaluation surface of any AI feature and identify where it can fail
AI Analyst Lab | AI Evals for Product Dev | Week 1 · Lesson 1 | aianalystlab.ai