Lesson 3.3: Evaluation signals

Week 3 · Lesson 3 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

What criteria did you use to decide which examples to include versus exclude in your regression suite?

Include
  • Covers common failure mode
  • Representative of production traffic
  • Has clear ground truth
Exclude
  • Edge case with no production frequency
  • Ambiguous expected output
  • Duplicate of existing coverage

Data without signals is noise

Your AI Data Analyst is fully instrumented. Every stage emits data: retrieval scores, document IDs, SQL strings, execution results, narrative text, latencies, token counts.
PM asks: "Is v1 ready to ship?" You cannot answer. Not because data is missing — because nobody decided what to measure.

Evaluation signals bridge traces to decisions through type and role

Instrumented traces
what the system logged
retrieval scores
SQL strings
narrative text
Evaluation signals
measurable indicators
signal type + role
Decisions
ship, ramp, hold, rollback
Every signal has two properties: type (how it's computed) and role (what decision it informs).

Execution-based oracles verify functional correctness, not textual similarity

String matching approach
SELECT ... FROM (SELECT ...)
WITH temp_table AS (...) SELECT ...
✗ Different syntax → marked wrong
Oracle approach
Run both queries
Compare result sets
✓ Identical results → marked correct
Correctness oracle = run the artifact, check the result.

Structural signals check format and schema — fast, limited to form over substance

JSON validity
Parses without errors
SQL runs without errors
No syntax errors
Required fields present
Schema validation passes
Response length within bounds
Character count in range
Fast and cheap — but catches formatting failures, not logic errors.

Semantic signals require rubrics and introduce judge variance

1
Semantic signal
relevance, completeness, factual accuracy (does output match source data?), tone
2
Judge (human annotator or LLM) — Rubric = scoring criteria
Judge variance = re-scoring same output produces different scores
3
System variance
re-running AI feature on same input produces different outputs
Always specify which variance source you're measuring.

Gates block releases, diagnostics aid troubleshooting, drivers explain variance

Gates blocking
SQL correctness ≥ 90%

Must pass to ship
Diagnostics troubleshooting
Retrieval Hit Rate@5

Explains why correctness dropped
Drivers variance
Correctness by query complexity
92% simple, 55% aggregation
Design signals based on role, not availability.

What Hit Rate@5 does the AI Data Analyst achieve on 300 test queries?

Given 300 test queries with known relevant documents, what Hit Rate@5 does the system achieve?
Hit Rate@5 = for what percentage of queries does at least one relevant document appear in the top-5 results?
Write your prediction before running the analysis cell.

32% of queries miss relevant context — retrieval failures cascade to SQL errors

Hit Rate@5 = 0.68 (95% CI: 0.64, 0.72, n=300 queries)
32% of queries miss relevant context in top-5 results
Retrieval failure
No relevant doc in top-5
Missing context
No metric definitions, schema, business logic
SQL generation without context
Operates without needed info
SQL errors + hallucinations
Bad queries, wrong narratives

Compute Precision@5, MRR, SQL correctness, and build your signal catalog

  • Compute Precision@5 and MRR with bootstrap CIs
  • Interpret the precision-recall relationship
  • Implement SQL correctness checking with oracle queries
  • Fill signal catalog with 5 signals (execution, structural, semantic, diagnostic, gate)
Extend version +15 min
  • Implement MRR from scratch
  • Handle approximate numeric matches in SQL oracle
  • Add semantic signal with judge variance documentation
  • Compute inter-metric correlation
  • Expand to 10 signals

A signal catalog mapping ground truth to actionable metrics, color-coded by role

Signal Name Type What You Check Against Role Cost Computation Method
Retrieval Hit Rate@5 Execution Relevant doc IDs Diagnostic Low aieval.metrics.retrieval_hit_rate_at_k()
SQL Correctness Execution Oracle queries Gate Medium aieval.oracles.score_sql()
Narrative Accuracy Semantic Human rubric Driver High LLM judge
JSON Validity Structural Schema definition Diagnostic Low json.loads() validation
Query Complexity Structural SQL structure analysis Driver Low Parse SQL, count aggregations
Color-coded by role: Gates Diagnostics Drivers

Measuring everything available instead of designing based on roles

Symptom
Dashboard with 20+ metric tiles
Model change shows improvement on half, regression on the other half — no one can make a decision.
Fix
Focused signal list with 5 metrics, each labeled with role (gate/diagnostic/driver)
Define signal roles first. Every signal must answer a specific question.

76% correctness, 71% Hit Rate@5 with CI — can you ship?

Scenario 1
You have 50 oracle queries. 38 match exactly. SQL correctness point estimate = ?

What additional information is needed for decision-ready evidence?
Scenario 2
PM says: "Ship if Hit Rate@5 > 70%." You measure: Hit Rate@5 = 71% (95% CI: 65-77%, n=300).

Ship based on this signal alone?
Apply the concepts — don't just recall definitions.

Signal taxonomy decision tree — from ground truth to catalog

What ground truth is available?
Execution-based checks (oracles)
Executable outputs (SQL, code, API calls)
SQL result comparison
Tool call outcome check
Structural signals
Format, schema properties
JSON validity
SQL syntax check
Semantic signals (judges)
Semantic qualities (relevance, tone)
Human annotation scores
LLM judge accuracy rating
Signal Catalog (artifact)
Signal role determines usage: Gates (must-pass) Diagnostics (troubleshooting) Drivers (variance explanation)

Next: Similarity metrics and retrieval-specific metrics

You've defined signals by type and role. Next, we'll implement similarity metrics (ROUGE, BLEU, embedding cosine) and retrieval metrics (precision, recall, MRR, NDCG) — when to use each, how to interpret them, and why execution-based oracles beat them when applicable.
AI Analyst Lab | AI Evals for Product Dev | Week 3 Lesson 3 | aianalystlab.ai