Judge Reliability + Bias

Week 3 Lesson 6 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

What was your judge's Cohen's Kappa score on the dev set, and what does that number mean using the interpretation scale?

YOUR ANSWER
Your Cohen's Kappa score:
_______
HOW TO READ KAPPA
< 0.00: Poor
0.00-0.20: Slight
0.21-0.40: Fair
0.41-0.60: Moderate
0.61-0.80: Substantial
0.81-1.00: Almost Perfect

Your judge says 78% pass rate, but the judge itself has TPR = 0.90 and TNR = 0.85

WHAT THE JUDGE REPORTS
Pass Rate
78%
WHAT YOU KNOW ABOUT THE JUDGE
TPR = 0.90 (catches 90% of real passes, misses 10%)
TNR = 0.85 (catches 85% of real failures, lets 15% slip through as false passes)
The judge is imperfect
The Problem
When the judge says 78% pass rate, the real pass rate might be higher or lower. Without correction, every metric is systematically biased.

Chen, Zaharia & Zou (2023): GPT-4 accuracy on a prime number identification task dropped from 97.6% to 2.4%

MARCH 2023
97.6%
JUNE 2023
2.4%
Teams using GPT-4 as a judge during this period saw evaluation pipelines silently produce unreliable scores

The Rogan-Gladen correction estimates the true rate from an imperfect judge

1
Observed Rate (from judge)
Example: 78%
2
Judge Error Rates
TPR = 0.90, TNR = 0.85
3
Corrected True Rate
true_rate = (observed_rate + TNR - 1) / (TPR + TNR - 1)
Adjusts observed rate up or down based on how often the judge gets it wrong in each direction
Validity Constraint
Only valid when TPR + TNR > 1 (judge is better than random). When the denominator approaches zero, the correction becomes unstable.

Position bias: the judge systematically favors whichever answer appears first or second

ORDER 1
Output A | Output B
Winner: A
ORDER 2
Output B | Output A
(same outputs, swapped)
Winner: B
Zheng et al. 2023: approximately 40% inconsistency when order is swapped
Most dangerous in close-call evaluations — precisely when the judge's preference matters most

Verbosity bias: the judge favors longer answers regardless of quality

STANDARD AUTO-EVAL
0.94
Correlation with human prefs
AFTER LENGTH CORRECTION
0.98
Changed benchmark rankings
SHORT CORRECT ANSWER
3 lines of text
Correct and concise
Score: 7/10
VERBOSE CORRECT ANSWER
8 lines of text
Correct but lengthy
Score: 9/10

Your judge reports 78% pass rate on 200 examples (TPR = 0.90, TNR = 0.85). Is the true pass rate higher or lower?

Hint: Think about what TNR = 0.85 means. If the judge has 85% specificity, what fraction of actual failures does it incorrectly label as passes?
Your Prediction: ___________

The corrected rate is 84% — higher than the observed 78%

Numerator: observed_rate + TNR - 1 = 0.78 + 0.85 - 1 = 0.63
Denominator: TPR + TNR - 1 = 0.90 + 0.85 - 1 = 0.75
Corrected rate: 0.63 / 0.75 = 0.84
Why higher?
The judge was too strict. TPR = 0.90 means 10% of true passes were missed. When you account for these missed passes, the estimated true rate goes up.

How much should you trust the corrected estimate? Bootstrap confidence intervals tell you.

Observed Rate (single point)
78%
Corrected Rate (Rogan-Gladen) with 95% CI
84%
80% — 88%
Why is the range wide?
Your TPR and TNR come from 100 labeled examples. Bootstrapping resamples the data 1000 times to see how much the corrected rate could change with different samples.

Position bias test: 25% of pairs flipped when order was swapped

Order 1 Winner Order 2 Winner Consistent?
A A Yes
A B No
B B Yes
B A No
POSITION BIAS SCORE
0.25
25% of pairs inconsistent — one in four driven by position, not quality

Run position and verbosity bias tests, then apply the full correction pipeline to the test split

  • Run aieval.judges.position_bias_test() on position pairs from bias test set
  • Run aieval.judges.verbosity_bias_test() on verbosity pairs
  • Apply full correction pipeline to test split: compute TPR/TNR, apply Rogan-Gladen correction with bootstrap CIs
  • Compare test-split TPR/TNR to dev-split values. Flag if difference > 15 points
  • Write one-paragraph summary: observed vs corrected rate, bias detection results
Extend Version
Implement Rogan-Gladen from scratch, investigate edge case when TPR + TNR approaches 1.0

Judge Report Card: 12 fields documenting whether to trust your judge

Judge Prompt Version
v2.1
Calibration Set Size
n=200
Dev/Test Split
50/50
Dev TPR / TNR
0.90 / 0.85
Test TPR / TNR
0.88 / 0.86
Cohen's Kappa (dev)
0.78
Observed Pass Rate
78%
Corrected Pass Rate
84% [80%-88%]
Position Bias Score
0.25 (detected)
Verbosity Bias
Not detected
Correction Stability
76%-92%
Deployment Rec
Approve with CI

Reporting raw judge scores without correction changes ship decisions

Scenario Raw Judge Score Corrected Rate Ship Decision
Release threshold = 75% 78% 72% SHIP → HOLD
Release threshold = 70% 68% 74% HOLD → SHIP
Same data, different decision
Every team that reports uncorrected judge scores is making decisions on biased evidence. The correction is not overhead — it changes real outcomes.

Three additional patterns that corrupt judge scores

Ignoring CI width
Hides decision-critical uncertainty
Skipping bias tests
Corrupts A/B experiments
Overfitting to dev
Test TPR drops 14 points

TPR = 0.95, TNR = 0.70, observed = 80%. Will the corrected rate be higher or lower?

Question 1
TPR = 0.95, TNR = 0.70, observed = 80%. Will corrected rate be higher or lower? Why?
Question 2
Position bias score = 0.35. Your colleague says "that is fine, it is below 50%." For reference, Zheng et al. found GPT-4 showed ~40% position bias. Is 35% acceptable?
Question 3
Dev-set TPR = 0.92, test-set TPR = 0.78. That is a 14-point gap. What does this suggest?
Hint for Q1
Think about what a low TNR means for false passes.

Two-panel summary: correction pipeline + bias detection

CORRECT FOR IMPERFECT MEASUREMENT
1. Raw Judge Scores (observed_rate = 0.78)
↓ compare to ground truth
2. Confusion Matrix (TPR = 0.90, TNR = 0.85)
↓ apply correction
3. Rogan-Gladen Formula
↓ bootstrap 1000x
4. Corrected Rate + Bootstrap CI
(corrected_rate = 0.84, 95% CI: [80%, 88%])
DETECT SYSTEMATIC BIAS
Position Bias:
Output A | Output B
Output B | Output A (swapped)
→ Did preference flip? Yes=bias, No=ok

Verbosity Bias:
Concise (correct)
Verbose (correct)
→ Did score increase with length? Yes=bias, No=ok
Both pipelines feed into Judge Report Card (12 fields)
Complete characterization of your measurement instrument

Next: Metric strategy, blocking metrics vs optimization metrics

Not all metrics have the same job — learn how to design a metric system that separates release gates from improvement levers
AI Analyst Lab | AI Evals for Product Dev | Week 3 Lesson 6 | aianalystlab.ai