Evaluation Surface Map

Lesson 1.2 — Week 1 · Introduction to AI Product Evaluation
Shane Butler · AI Analyst Lab

In L1.1, you ran 5 trials and saw variance — but could you tell WHERE failures happened?

What you measured in L1.1
pass@5
1.0
reliable@5
0.4
0.6 gap between metrics
What you couldn't answer
Was the SQL wrong?
Was the retrieval bad?
Did the narrative hallucinate?

Your AI feature fails in production and you can't tell which stage broke

Query Parsing
Intent + params
->
Retrieval
Context docs
->
SQL Gen
Build query
->
SQL Exec
Run + results
->
Charts
Visualize
->
Narrative
Summarize
"Works great for simple lookups but fails on anything complex"
"Sometimes it hallucinates metrics that don't exist"
"SQL queries occasionally fail silently"
Which stage caused each failure? With the current minimal logging, impossible to tell.

Without mapping the surface, you're shipping with blind spots you can't even name

Without surface map
  • Can't prioritize what to test
  • Can't assess coverage completeness
  • Can't communicate risk to stakeholders
  • Pass basic tests, fail in production
With surface map
  • Know what CAN fail (functional)
  • Know what attackers COULD exploit
  • Know what you're NOT testing (gaps)

Map failure modes by pipeline stage — each stage breaks in its own way

Pipeline Stage Example Failure Modes
Query Parsing Ambiguous intent, multi-intent queries, unsupported query types
Context Retrieval Wrong docs retrieved, missing context, low relevance scores
SQL Generation Syntax errors, wrong joins, missing filters, non-existent columns
Narrative Synthesis Hallucinated numbers, claims not in results, incoherent summaries
Each stage is a distinct evaluation surface with its own set of failure types
4 of 6 stages shown — SQL Execution and Chart Rendering covered in the exercise

Predict failures from architecture first, then validate with traces

Step 1: Architecture-Driven
Examine inputs, outputs, dependencies of each stage
  • Predict failure categories
  • No data required
  • Reason from system design
Step 2: Trace-Driven
Inspect v0 traces to confirm and refine
  • Validate predictions
  • Discover sub-types
  • Find what you missed
Prediction gives you a head start. Traces refine it.

Adversarial inputs are part of the evaluation surface from Day 1

Prompt Injection
Crafted inputs manipulate system behavior
Jailbreaks
Bypass safety constraints
Policy Violations
Generate prohibited content
Private Data Leaks
Extract personal information from context
Resource Exhaustion
Overload system with expensive queries
Prompt injection attacks can succeed at very high rates — often above 80% in controlled studies (Greshake et al., 2023)
Adversarial testing is not optional — it's part of the evaluation surface from Day 1

A surface map with no gaps listed is overconfident, not thorough

Unmeasured Failure Modes
Failures you suspect exist but haven't observed in traces
"Narrative hallucination rate unknown"
Rare Scenarios
Edge cases not in your trace sample
"No multi-turn sessions > 4 turns in sample"
Logging Blind Spots
System behaviors you CANNOT observe with current logging
"v0 doesn't log retrieved context"
Every map MUST list blind spots — a map with no gaps is overconfident, not thorough

Split your map into "can evaluate now" vs. "need better logging first"

Can evaluate now
  • SQL syntax errors (sql_error field exists)
  • Execution timeouts (timeout logs exist)
  • Query parse failures (error responses visible)
Cannot evaluate yet
  • Narrative hallucination rate (no context logged)
  • Multi-turn coherence (no session tracking)
  • Retrieval relevance (no relevance scores)
The "cannot evaluate" list becomes your Week 2 logging priorities

Which pipeline stage has the MOST failure modes?

Query Parsing
->
Retrieval
->
SQL Gen
->
SQL Exec
->
Charts
->
Narrative
Write your prediction in one sentence before advancing

SQL generation produces 4 distinct failure types — traces reveal sub-types

Predicted Failure Type Trace Evidence Surprise from Traces
Syntax errors sql_error: "near WHERE: syntax error" Confirmed as predicted
Wrong table joins SQL references tables not matching intent Splits into 2 sub-types
Missing filters SQL executes but returns too many rows WHERE clause omitted
Non-existent columns sql_error: "no such column" Confirmed as predicted
Architecture predicted 4 types. Traces refined "wrong joins" into 2 sub-types.

Generation failures outnumber retrieval failures 78% to 22%

78.3%
Generation
21.7%
Retrieval
Most teams predict retrieval is the bigger problem
Source: Clinical text-to-SQL system, 2025

Build your evaluation surface map

Base version (all students, 20-25 min)
  • Fill in retrieval + chart rendering failure modes (30 traces)
  • Categorize 100 adversarial traces by attack type
  • Identify 3+ coverage gaps
  • Evidence sufficiency: can evaluate vs. cannot
Extend version (DS/Eng, +15-20 min)
  • Quantitative failure distribution (500 traces)
  • Which adversarial attacks do functional tests already catch?
  • Have you found all the failure types, or are there more hiding?

Your surface map identifies 15+ failure modes, 5 attack categories, and 3 coverage gaps

Evaluation Surface Map — AI Data Analyst (v0)
Functional Surface
10+ failure modes across 6 pipeline stages, each with predicted vs. observed status
Adversarial Surface
5 attack categories mapped to vulnerable pipeline stages
Coverage Gaps
3+ critical blind spots: unmeasured, rare, logging
What I built
An evaluation surface map covering functional, adversarial, and coverage dimensions — portfolio-ready artifact that answers "What can fail?" and "What haven't we tested?"

Teams annotate observed failures without predicting from architecture

Failure Mode What Happens What To Do Instead
Annotation without prediction Only map what you've seen — miss rare catastrophic failures Predict from architecture BEFORE inspecting traces
Functional + adversarial treated separately Adversarial deferred to "security review later" — ships without it Map adversarial surface alongside functional from Day 1
No coverage gaps identified False confidence — you think you've tested everything Every map MUST list blind spots and instrumentation limits

92% of 100 queries passed — but all 100 are simple lookups. Should you ship?

1
Surface coverage
100 queries, 92% pass — but all simple lookups. No joins, no multi-turn. What does this reveal?
2
Missing attack category
6 functional + 2 adversarial mapped, but indirect prompt injection via poisoned retrieved documents not caught. Why?
3
Completeness check
4 SQL failure types from 50 traces. Have you found all the types that matter?

Three layers: functional failures, adversarial vectors, coverage gaps

F
Functional Surface
Query Parsing | Retrieval | SQL Gen | SQL Exec | Charts | Narrative — each with distinct failure modes
A
Adversarial Surface
Prompt Injection | Jailbreak | Policy Violation | PII Extraction | Resource Exhaustion
G
Coverage Gaps
Unmeasured failure modes | Rare scenarios | Instrumentation blind spots
A complete surface map covers all three layers. Most teams only build the first.

Next: Failure Discovery

You mapped where failures CAN happen. Next you'll systematically find the ones that ARE happening.
AI Analyst Lab | AI Evals for Product Dev | Week 1 · Lesson 2 | aianalystlab.ai