Execution-based checks (oracles)
Executable outputs (SQL, code, API calls)
SQL result comparison
Tool call outcome check
Structural signals
Format, schema properties
JSON validity
SQL syntax check
Semantic signals (judges)
Semantic qualities (relevance, tone)
Human annotation scores
LLM judge accuracy rating