Lesson 2.3: Trace design + tooling

Each pipeline stage gets its own section in the trace (called a "span")

Retrieval span

retrieval_query (what question was sent to search), num_docs_retrieved (how many came back), top_doc_score (how relevant the best result was)

SQL span

generated_sql (the query that was written), sql_success (did it execute without errors?), sql_latency_ms (how long did it take?)

Chart span

chart_type (bar, line, etc.), chart_success (did it render?), chart_error (what broke if it failed?)

Narrative span

narrative_text (what was written), model_used (which LLM), prompt_tokens (how much context?)

Each span = one pipeline stage with its own fields

Metric	v0	v1
Fields populated	6	19
Fields empty	14	1
Completeness %	42%	95%
What's measurable	Success rate, latency, query volume	28 metrics spanning all pipeline stages

Field	v0	v1	What becomes measurable
retrieval_query	--	Yes	Retrieval quality — how well the search system interprets user questions
retrieved_doc_ids	--	Yes	Are we finding the right documents? Which docs get used most?
generated_sql	--	Yes	SQL correctness — does the query match the user's intent?
sql_success	--	Yes	SQL error rate — how often do queries fail? What are the failure patterns?