Lesson 2.3: Trace design + tooling

Week 2 · Instrumentation Design
Shane Butler · AI Analyst Lab

Which field was missing in v0 that made success vs failure invisible?

v0 trace fields (basic version from L2.1)
  • request_id
  • user_query
  • final_answer
  • latency_ms
  • success
  • timestamp
What's missing?
?

Think back to L2.1 — which field did v1 add that let you see SQL execution success or failure?

v0 traces tell you "it worked" but not where failures occurred

What PM asks
"User reports conversion rate of 4.2% but BI dashboard says 3.8%. What went wrong?"
What v0 trace shows
success: true
latency_ms: 2300
Cannot answer: No SQL query logged, no retrieval data, no stage-level results.

A trace is a structured record of one request's path through your AI system

User query
Retrieval
SQL gen
Chart render
Narrative
Final answer
One trace = one query's journey through the pipeline
A trace schema specifies which fields to capture at each stage — it's the blueprint for your evidence collection.

Each pipeline stage gets its own section in the trace (called a "span")

1
Retrieval span
retrieval_query (what question was sent to search), num_docs_retrieved (how many came back), top_doc_score (how relevant the best result was)
2
SQL span
generated_sql (the query that was written), sql_success (did it execute without errors?), sql_latency_ms (how long did it take?)
3
Chart span
chart_type (bar, line, etc.), chart_success (did it render?), chart_error (what broke if it failed?)
4
Narrative span
narrative_text (what was written), model_used (which LLM), prompt_tokens (how much context?)
Each span = one pipeline stage with its own fields

Flatten trace data into rows so you can filter, sort, and compute metrics

Raw trace data (hard to analyze)
{
  "request_id": "req_0127",
  "spans": {
    "retrieval": {...},
    "sql": {...}
  }
}
Flattened data (easy to filter and query)
req_0127 mobile DAU... 5 docs true
spans_to_row() converts nested traces to DataFrames — now you can filter, group, and compute metrics using standard pandas operations.

Completeness measures what percentage of fields are filled in

Metric v0 v1
Fields populated 6 19
Fields empty 14 1
Completeness % 42% 95%
What's measurable Success rate, latency, query volume 28 metrics spanning all pipeline stages
+53 percentage points
53pp improvement = 10x more questions you can answer

The Instrumentation Readiness Report documents what becomes measurable

  • Schema comparison table: field name, in v0?, in v1?, what becomes measurable
  • Completeness scores: v0 vs v1 (percentage and counts)
  • Per-stage field specifications with justifications
  • Validation results: does the logging actually capture the fields we designed? (tested on real data)
This report proves instrumentation investment is worth engineering time.

Which schema enables debugging retrieval failures?

Option A
  • num_docs_retrieved
  • retrieval_latency_ms
Option B
  • num_docs_retrieved
  • retrieval_latency_ms
  • retrieval_query
  • top_doc_score
  • retrieved_doc_ids
A user reports factually incorrect metric definition. Which schema lets you prove retrieval was the failure point?
Pause and predict before advancing.

v0 has 6 fields, 42% completeness — analytically useless

request_id: req_0127
user_query: "What's our DAU trend for mobile app last 30 days?"
final_answer: "Your mobile app DAU averaged 47,320..."
latency_ms: 2300
success: true
timestamp: 2025-01-15T14:23:01Z
Missing: retrieval data, SQL query, SQL success/error, chart type, per-stage latencies.
Cannot compute SQL correctness, retrieval quality, or latency breakdown by stage.

v1 has 19 fields, 95% completeness — every failure type measurable

Retrieval span
retrieval_query: "mobile app DAU..."
num_docs: 5
top_score: 0.84
latency_ms: 180
SQL generation span
generated_sql: "SELECT date..."
sql_success: true
sql_latency_ms: 450
sql_row_count: 30
19 fields populated, 1 empty (sql_error — blank when SQL succeeds)
Now you can answer: Did retrieval get the right docs? Did SQL execute correctly? Which stage caused latency?

v0 enables 3 metrics, v1 enables 28 metrics

Field v0 v1 What becomes measurable
retrieval_query -- Yes Retrieval quality — how well the search system interprets user questions
retrieved_doc_ids -- Yes Are we finding the right documents? Which docs get used most?
generated_sql -- Yes SQL correctness — does the query match the user's intent?
sql_success -- Yes SQL error rate — how often do queries fail? What are the failure patterns?
10x more questions you can answer
3 metrics versus 28. That's not incremental. That's a different category of capability.

Design complete trace schema and produce readiness report

Base (20-25 min)
  • Compute v0 completeness (sample 10 traces)
  • Compute v1 completeness (sample 10 traces)
  • Complete SQL generation schema
  • Design chart rendering schema from scratch
  • Assemble Instrumentation Readiness Report
Extend (DS/Eng, +15 min)
  • Write custom trace validator
  • Check if your traces match the fields you designed — test on 20 v0 + 20 v1 traces
  • Report how many traces match and common gaps

Logging everything without a design principle buries the useful data

Logging everything
Impact: Storage cost increases 40x, privacy violations from personal data in traces, 80% of fields never queried
How to avoid: Apply the blocking metric test — reject any field that isn't required for a specific, named metric
Not versioning your schema
Impact: Time-series analysis breaks because old traces lack new fields, historical debugging impossible
How to avoid: Include schema_version field, document changes, fill in missing fields with placeholder values like 'unknown'

Not testing your schema on real traces before deploying

Deploying untested instrumentation
Impact: Instrumentation bugs go undetected, an entire week of data missing critical fields, broken metrics
How to avoid: Deploy to staging first, generate 100 test traces, validate 95%+ of traces match your schema before going to production

Apply schema design thinking to real debugging scenarios

1
Scenario: Trace has generated_sql but sql_success is empty
What's impossible to measure? SQL correctness rate — you don't know if the query succeeded or failed
2
Scenario: Teammate wants to log the full prompt (2000 chars) in every trace
How do you decide? Cost-benefit: $13/year storage vs debugging value — log it, but sample in production
3
Scenario: v0 shows latency_ms: 2300 but no stage breakdown. PM says fix it.
What's the problem? You're optimizing blind — can't identify which stage is the bottleneck

v0 to v1: from 3 metrics to 28

v0 Trace
6 fields
42% completeness
3 metrics
v1 Trace
19 fields (4 spans)
95% completeness
28 metrics
Process: Identify stages → Specify fields per stage → Validate on real traces

Summary: 10x more questions you can answer

Fields: 6 → 19 (+217%)
Completeness: 42% → 95% (+53pp)
Metrics: 3 → 28 (9.3x)
Ship decisions: No evidence → Evidence-backed

Next: The regression safety net — automated evaluation before every code change ships

You have instrumentation. Now you'll automate evaluation so every code change gets tested against real traces before it ships.
AI Analyst Lab | AI Evals for Product Dev | Week 2 Lesson 3 | aianalystlab.ai