Semantic metrics / LLM judges

Kappa Range	Interpretation
< 0.00	Poor
0.00 - 0.20	Slight
0.21 - 0.40	Fair
0.41 - 0.60	Moderate
0.61 - 0.80	Substantial
0.81 - 1.00	Almost perfect

Trace ID	Human Label	Judge Verdict	Agreement
t_0042	Fail	Fail	Yes
t_0108	Pass	Pass	Yes
t_0215	Fail	Pass	No
t_0317	Pass	Pass	Yes
t_0422	Fail	Fail	Yes

Dev Kappa 0.75, test Kappa 0.58 — what does this gap suggest, and what should you do?

Scenario 1

Your judge performs well on examples you tested it on, but worse on new examples.

Dev Kappa = 0.75 (substantial)
Test Kappa = 0.58 (moderate)
Gap = 0.17

What does this suggest?

Scenario 2

Colleague proposes 1-5 scale instead of binary Pass/Fail because "it gives us more information."

What are the tradeoffs?

Scenario 3

Judge has Kappa = 0.72, pass rate = 68%

PM asks "Can we trust this 68%?"

What do you tell them?

Semantic metrics / LLM judges

What failure mode can occur when retrieval succeeds but the generated narrative still fails?

SQL is correct, chart is accurate — but the narrative tells the PM that Q4 revenue grew 12% when the data shows 8%

An LLM judge is a model that evaluates another model's output using a structured rubric

Binary Pass/Fail produces more reliable ratings than 1-5 scales

Has anyone here debugged a system output and had no idea why it failed? That's what happens without structured reasoning.

Data partitioning: train 10%, dev 45%, test 45% — the test set is sacred

Simple percent agreement is misleading — two raters who always agree tell you nothing if they're both wrong

What percentage agreement do you expect between your first-draft judge and human labels?

Train split: 70% agreement, Kappa = 0.52 (Moderate)

The agreement grid reveals the pattern: the judge is too lenient on partial omissions

Iterative refinement on dev set + test-set validation

Semantic Rubric with dev Kappa 0.72, test Kappa 0.69 — no significant overfit

A judge with Kappa < 0.6 is unreliable for ship decisions

Dev Kappa 0.75, test Kappa 0.58 — what does this gap suggest, and what should you do?

Five-step LLM judge development lifecycle

Next: Scaling semantic evaluation with statistical confidence