Judge Reliability + Bias

Run position and verbosity bias tests, then apply the full correction pipeline to the test split

Run aieval.judges.position_bias_test() on position pairs from bias test set
Run aieval.judges.verbosity_bias_test() on verbosity pairs
Apply full correction pipeline to test split: compute TPR/TNR, apply Rogan-Gladen correction with bootstrap CIs
Compare test-split TPR/TNR to dev-split values. Flag if difference > 15 points
Write one-paragraph summary: observed vs corrected rate, bias detection results

Extend Version

Implement Rogan-Gladen from scratch, investigate edge case when TPR + TNR approaches 1.0

Scenario	Raw Judge Score	Corrected Rate	Ship Decision
Release threshold = 75%	78%	72%	SHIP → HOLD
Release threshold = 70%	68%	74%	HOLD → SHIP