1. Rubric Design
Binary Pass/Fail, few-shot examples, structured reasoning
↓
2. Data Partitioning
Train 10%, Dev 45%, Test 45%
↓
3. Prompt Iteration
Refine on dev, track Kappa
↓
4. Agreement Measurement
Cohen's Kappa, Landis-Koch interpretation
↓
5. Test-Set Validation
One-time run, overfit check