Failure discovery

Lesson 1.3 · Week 1 · Introduction to AI Product Evaluation
Shane Butler · AI Analyst Lab

In L1.2, you mapped where failures CAN happen — but which surfaces could you not evaluate with v0's limited logs?

What you built in L1.2
Functional failures
SQL errors, wrong outputs, latency
Adversarial vectors
Prompt injection, deliberate misuse, edge cases
Coverage gaps
What you can't evaluate yet
The gap you identified
Which failure modes landed in the "cannot evaluate" column?
What did you list as logging blind spots?

Fifty complaints in seven days, and no two are described the same way

"Gave me last quarter's data when I asked for this quarter"
"SQL failed but no error message"
"Narrative said 'slight decline' when revenue dropped 40%"
"Completely hallucinated a metric that doesn't exist"
50 complaints. No two described the same way. Where do you even start?

Predefined checklists miss the failures you haven't imagined yet

Checklist approach
  • Hallucination?
  • SQL syntax error?
  • Latency spike?
  • Missing data?
Misses failures that don't fit these buckets
Discovery approach
  • Correct SQL, wrong time range
  • Narrative minimizes a 40% drop
  • System pulled wrong background info for the question
Captures what actually happened

Read the traces first, let the categories emerge — don't start with a checklist

Top-down: Start with categories, force-fit traces
You see what you expect. You miss what you don't.
Bottom-up: Start with traces, let categories emerge
You see what's actually there. Categories come from the data.
Categories come from the data, not from prior assumptions. Read first. Name second.

Five steps from raw traces to actionable taxonomy: read, note, cluster, saturate, triage

Read traces
Open mind, no categories
Freeform notes
One note per trace, your words
Cluster bottom-up
Group similar notes
Check saturation
New categories still emerging?
Triage
Prompt-fix / Evaluator-needed / System-fix
Five steps. Read, note, cluster, saturate, triage. Simple to describe. The discipline is in not skipping ahead.

If you discover a new category on trace 30, you haven't read enough

Category discovery curve
Categories found
5
10
20
25
30
Traces read
Decision at trace 30
New category?
Keep reading
No new categories?
Proceed to triage

Every failure category is either a prompt-fix, an evaluator-needed, or a system-fix

Prompt-fix
Improve with prompt engineering this sprint.
Example: Narrative minimizes findings — add severity language to prompt.
Evaluator-needed
Build a metric to monitor. This is Week 3's focus.
Example: Wrong time range — need an automated check comparing query intent to SQL filters.
System-fix
Requires code or architecture change.
Example: Silent SQL failures — need the system to surface errors to the user, not a prompt tweak.

You have 500 traces — how many distinct failure categories do you think exist?

You have 500 v0 traces from the AI Data Analyst.
Before reading any of them:
1. How many distinct failure categories do you think exist? Write a number.
2. List 3 failure types you expect to find.
Write your predictions now. We'll compare after the demo.

Five traces, four distinct failure notes — and none of them fit "hallucination" or "SQL error" cleanly

Trace Observation Freeform Note
1 Revenue query, SQL used Q2 data for Q3 question "Correct SQL execution, wrong time range scope"
2 Straightforward query, output looks correct "No visible failure"
3 SQL failed to connect the right data tables "SQL syntax error — failed to connect tables"
4 Narrative says "slight decline" for 40% drop "Narrative minimizes significant finding"
Trace 5: System pulled revenue context for an engagement query — "Wrong background info returned for the question"
Five traces. Four distinct failure notes. One "no visible failure." The notes are messy. That's intentional.

Four traces produced four distinct categories — if this diversity holds, the full dataset will reveal many more

Scoping error
Correct SQL execution, wrong time range (Trace 1)
Generation error
SQL syntax failure — failed to connect tables (Trace 3)
Characterization error
Narrative minimizes a 40% drop as "slight decline" (Trace 4)
Retrieval error
Wrong background info pulled for the query (Trace 5)
4 traces, 4 distinct categories. A predefined checklist of "hallucination" and "SQL error" would have caught 1 of these 4.

The triage step turns a taxonomy into a sprint plan

Category Severity Triage Next Action
Scoping error (wrong time range) Critical Evaluator-needed Build a check that compares what time period was asked vs. queried
Generation error (SQL syntax) Major System-fix Add a check that catches SQL errors before the query runs
Characterization error (misleading narrative) Critical Prompt-fix Add severity-accuracy instruction to prompt
Retrieval error (wrong context) Major Evaluator-needed Build a check that the right background info was pulled
The taxonomy tells you WHAT is failing. The triage tells you WHAT TO DO about it — and who owns the fix.

Build a failure taxonomy from 50 v0 traces: annotate, cluster, check saturation, triage

Base version (all students, 20-25 min)
  • Annotate traces 6-15, then 30+ from a curated sample of 50
  • Cluster into named categories with example trace IDs
  • Assign severity and triage label per category
  • Run saturation check on 10 additional traces
Deliverable: Taxonomy table with 6+ categories. Complete your own annotation before comparing to any reference.
Extend version (DS/Eng, +10-15 min)
  • Compute approximate frequency per category
  • Assess how hard each failure is to detect automatically
  • Identify which categories need better logging in the next version
Deliverable: Extended taxonomy with frequency, detection difficulty, and logging gaps.

Your failure taxonomy identifies 6+ failure categories discovered bottom-up from raw traces, classified by severity and triage type

Portfolio Artifact
Category Description Example Severity Triage
Scoping error Wrong time range in SQL Trace 1 Critical Evaluator-needed
Generation error SQL syntax failure Trace 3 Major System-fix
Characterization Narrative misrepresents data Trace 4 Critical Prompt-fix
Retrieval error Wrong context returned Trace 5 Major Evaluator-needed
N categories | M traces annotated | Saturation verified
What I built — A failure taxonomy for the AI Data Analyst with 6+ distinct failure categories discovered bottom-up from raw v0 traces, classified by severity and triage type.

Starting with predefined categories and force-fitting traces misses the failures that matter most

Failure Mode What Happens What To Do Instead
Premature categorization Start with predefined buckets, force-fit traces, miss unexpected failure types Read 30+ traces with freeform notes BEFORE creating any categories
Insufficient reading depth Read 10 traces, declare saturation — miss 3+ categories that emerge after trace 20 Run the saturation check: 10 more traces, zero new categories
Mixing up severity with frequency Mark rare-but-critical failures as minor because they're infrequent Severity = impact per incident. Frequency = how often. Separate the two

The narrative omits the most important finding — prompt-fix, evaluator-needed, or system-fix?

1. You discover a failure mode where the AI generates correct SQL and accurate numbers, but the narrative summary omits the most important finding. Prompt-fix, evaluator-needed, or system-fix? Why?
2. After reading 25 traces, you have 7 categories and the last 5 traces all fell into existing categories. A colleague says "you've reached saturation." What would you check?
3. "Wrong time range scope" failures are rare (3%) but critical. "Verbose narrative" failures are common (15%) but minor. Which should you build a metric for first?

Five steps: read traces, freeform notes, cluster bottom-up, check saturation, triage — then three downstream outputs

Read traces
Open mind
Freeform notes
Your words
Cluster
Bottom-up
Saturation
New categories?
Triage
Classify each
What to log
Coverage gaps feed Week 2 logging design
What to measure
Evaluator-needed categories feed Week 3 metric design
What to fix now
Prompt-fix categories go into this sprint

Next: Distributional thinking for AI quality

You discovered what's failing. Next you'll learn to think about failure rates as distributions, not single numbers — the foundation for every metric you'll build in Week 3.
AI Analyst Lab™ | AI Evals for Product Dev | Week 1 · Lesson 3 | aianalystlab.ai