Semantic metrics / LLM judges

Week 3 Lesson 5 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

What failure mode can occur when retrieval succeeds but the generated narrative still fails?

Retrieval Success
Right documents retrieved
Precision@k = 0.90
Narrative Failure
Hallucinated revenue number
Omitted key finding

SQL is correct, chart is accurate — but the narrative tells the PM that Q4 revenue grew 12% when the data shows 8%

SQL Execution
Returns 8% growth
Chart Rendering
Displays 8% bar
Narrative Generation
"Q4 revenue grew 12%"
Hallucination
The failure happened in narrative generation — SQL and chart metrics can't catch this

An LLM judge is a model that evaluates another model's output using a structured rubric

User Query
"How did Q4 revenue perform?"
AI Data Analyst
Generates narrative: "Revenue grew 12%"
LLM Judge
Evaluates: Does narrative match data? → Verdict: Fail

Binary Pass/Fail produces more reliable ratings than 1-5 scales

1-5 Rating Scale
Run 1: Score = 3
Run 2: Score = 5

Same narrative, different scores
Binary Pass/Fail
Run 1: Fail
Run 2: Fail

Consistent across runs

Has anyone here debugged a system output and had no idea why it failed? That's what happens without structured reasoning.

Judge Output Format
Evaluation: [step-by-step reasoning]
"All numbers match. Gap correct. Monthly claim supported by data."
Verdict: Pass
Reasoning before verdict
When the judge disagrees with a human, you can diagnose why

Data partitioning: train 10%, dev 45%, test 45% — the test set is sacred

Train (~10%)
Initial rubric testing
Few-shot examples
Dev (~45%)
Prompt iteration
Track Kappa across versions
Test (~45%)
Sacred: never iterate on it

Simple percent agreement is misleading — two raters who always agree tell you nothing if they're both wrong

Kappa Range Interpretation
< 0.00 Poor
0.00 - 0.20 Slight
0.21 - 0.40 Fair
0.41 - 0.60 Moderate
0.61 - 0.80 Substantial
0.81 - 1.00 Almost perfect
Cohen's Kappa: Agreement Beyond Chance
Target: Kappa ≥ 0.7 (Substantial agreement) — below 0.6 is too unreliable for ship decisions

What percentage agreement do you expect between your first-draft judge and human labels?

Your prediction: _______________

Write your prediction before running the next cell

Train split: 70% agreement, Kappa = 0.52 (Moderate)

Trace ID Human Label Judge Verdict Agreement
t_0042 Fail Fail Yes
t_0108 Pass Pass Yes
t_0215 Fail Pass No
t_0317 Pass Pass Yes
t_0422 Fail Fail Yes
70%
Raw Agreement
0.52
Cohen's Kappa
Moderate
Interpretation

The agreement grid reveals the pattern: the judge is too lenient on partial omissions

Human: Pass
Judge: Pass
4
Human: Pass
Judge: Fail
1
Human: Fail
Judge: Pass
3
Human: Fail
Judge: Fail
2
Judge is too lenient
Rates edge cases as Pass when humans say Fail — especially partial omissions above 5% threshold

Iterative refinement on dev set + test-set validation

Base Version
(PM-accessible)
  • Load dev split (~45 examples)
  • Run initial judge prompt
  • Compute Cohen's Kappa
  • Examine disagreements
  • Modify prompt
  • Re-run and log iteration
  • Stop when Kappa plateaus or ≥ 0.7
Extend Version
(DS/Eng)
  • Implement Kappa from scratch
  • Analyze difficulty-level disagreement
  • Investigate Kappa vs percent agreement

Semantic Rubric with dev Kappa 0.72, test Kappa 0.69 — no significant overfit

Semantic Rubric: Narrative Faithfulness
Evaluation Dimension: Narrative Faithfulness

Dev-Set Cohen's Kappa: 0.72 — Substantial (Landis-Koch)
Test-Set Cohen's Kappa: 0.69 — Substantial (Landis-Koch)

Overfit Check: Difference = 0.03 — No significant overfit
Portfolio-ready artifact
Documented judge with substantial agreement to human evaluators, validated on held-out test set with known limitations

A judge with Kappa < 0.6 is unreliable for ship decisions

Common Failure Mode
Judge with Kappa < 0.6 used to support ship decision

If your judge and human evaluators agree only 55% of the time after accounting for chance, you don't know whether the metric measures quality or noise.
The semantic rubric is your calibration certificate. Your proof.

Dev Kappa 0.75, test Kappa 0.58 — what does this gap suggest, and what should you do?

Scenario 1
Your judge performs well on examples you tested it on, but worse on new examples.

Dev Kappa = 0.75 (substantial)
Test Kappa = 0.58 (moderate)
Gap = 0.17

What does this suggest?
Scenario 2
Colleague proposes 1-5 scale instead of binary Pass/Fail because "it gives us more information."

What are the tradeoffs?
Scenario 3
Judge has Kappa = 0.72, pass rate = 68%

PM asks "Can we trust this 68%?"

What do you tell them?

Five-step LLM judge development lifecycle

1. Rubric Design
Binary Pass/Fail, few-shot examples, structured reasoning
2. Data Partitioning
Train 10%, Dev 45%, Test 45%
3. Prompt Iteration
Refine on dev, track Kappa
4. Agreement Measurement
Cohen's Kappa, Landis-Koch interpretation
5. Test-Set Validation
One-time run, overfit check

Next: Scaling semantic evaluation with statistical confidence

Test for position and verbosity bias · Apply Rogan-Gladen correction for imperfect judges · Establish monthly re-alignment cadence on sentinel set
AI Analyst Lab | AI Evals for Product Dev | Week 3 Lesson 5 | aianalystlab.ai