Failure discovery

Trace	Observation	Freeform Note
1	Revenue query, SQL used Q2 data for Q3 question	"Correct SQL execution, wrong time range scope"
2	Straightforward query, output looks correct	"No visible failure"
3	SQL failed to connect the right data tables	"SQL syntax error — failed to connect tables"
4	Narrative says "slight decline" for 40% drop	"Narrative minimizes significant finding"

Category	Severity	Triage	Next Action
Scoping error (wrong time range)	Critical	Evaluator-needed	Build a check that compares what time period was asked vs. queried
Generation error (SQL syntax)	Major	System-fix	Add a check that catches SQL errors before the query runs
Characterization error (misleading narrative)	Critical	Prompt-fix	Add severity-accuracy instruction to prompt
Retrieval error (wrong context)	Major	Evaluator-needed	Build a check that the right background info was pulled

Build a failure taxonomy from 50 v0 traces: annotate, cluster, check saturation, triage

Base version (all students, 20-25 min)

Deliverable: Taxonomy table with 6+ categories. Complete your own annotation before comparing to any reference.

Extend version (DS/Eng, +10-15 min)

Deliverable: Extended taxonomy with frequency, detection difficulty, and logging gaps.

Failure Mode	What Happens	What To Do Instead
Premature categorization	Start with predefined buckets, force-fit traces, miss unexpected failure types	Read 30+ traces with freeform notes BEFORE creating any categories
Insufficient reading depth	Read 10 traces, declare saturation — miss 3+ categories that emerge after trace 20	Run the saturation check: 10 more traces, zero new categories
Mixing up severity with frequency	Mark rare-but-critical failures as minor because they're infrequent	Severity = impact per incident. Frequency = how often. Separate the two

Category	Description	Example	Severity	Triage
Scoping error	Wrong time range in SQL	Trace 1	Critical	Evaluator-needed
Generation error	SQL syntax failure	Trace 3	Major	System-fix
Characterization	Narrative misrepresents data	Trace 4	Critical	Prompt-fix
Retrieval error	Wrong context returned	Trace 5	Major	Evaluator-needed