CORRECT FOR IMPERFECT MEASUREMENT
1. Raw Judge Scores (observed_rate = 0.78)
↓ compare to ground truth
2. Confusion Matrix (TPR = 0.90, TNR = 0.85)
↓ apply correction
3. Rogan-Gladen Formula
↓ bootstrap 1000x
4. Corrected Rate + Bootstrap CI
(corrected_rate = 0.84, 95% CI: [80%, 88%])
DETECT SYSTEMATIC BIAS
Position Bias:
Output A | Output B
Output B | Output A (swapped)
→ Did preference flip? Yes=bias, No=ok
Verbosity Bias:
Concise (correct)
Verbose (correct)
→ Did score increase with length? Yes=bias, No=ok