Lesson 2.5: System-type instrumentation

Week 2 Lesson 5 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

Which span fields let you separate SQL generation from execution failures?

Tool Selection
wrong tool chosen?
Argument Gen
malformed SQL?
Execution
database rejection?

V1 spans log that something failed, not why

V1 TRACE
SQL generation span shows failure, but no context
status: "failed"
stage: "sql_generation"
MISSING FIELDS
Critical instrumentation gaps highlighted
tool_selection_alternatives: MISSING (which tools were considered)
schema_validation: MISSING (was the SQL valid?)
execution_error_type: MISSING (how the database rejected it)

Three system types, three instrumentation patterns

LLM Apps
prompt → response

Failures: schema violation, token limit
RAG Systems
retrieval → generation

Failures: no candidates, low relevance
Agents
reasoning → tools → state

Failures: wrong tool, bad args, exec error

LLM apps log prompt config and response metadata

1
User request
Input query or command
2
Prompt + Config
template version, model params (temp=0.7, max_tokens=500)
3
Response + Metadata
finish_reason (why the model stopped), token_counts, schema validation
Required fields
prompt template version · model config · schema validation · token counts · finish_reason (completed normally or hit limit?)

RAG systems track retrieval → ranking → generation separately

Retrieval
query transform
Candidates
IDs + scores
Reranking
score update
Context
to generator
Stage Required Fields
Retrieval retrieval_query, transformed_query (rewritten search query)
Candidates candidate_doc_ids, retrieval_scores (relevance scores), rank_order
Reranking reranking_scores (updated scores), ranker_used (which scoring method)
Context context_provided, chunks_used, citations_in_output

Tool calling has 4 stages, each needs distinct fields

Selection
tools_considered
tool_selected
selection_rationale
Argument Gen
tool_params_generated
schema_validation_result
Execution
tool_call_success
execution_latency
tool_output
Output Handling
how_output_used
next_state

Complex systems combine patterns — the AI Analyst uses all three

Query Parsing
LLM App
Context Retrieval
RAG
SQL Generation
Agent
SQL Execution
Agent
Chart Rendering
LLM App
Narrative Gen
LLM App

Which pipeline stage has the most missing instrumentation?

  • Query Parsing
  • Context Retrieval
  • SQL Generation
  • SQL Execution
  • Chart Rendering
  • Narrative Generation
Which stage has the most critical missing fields? Write your prediction and the specific fields you think are missing.

SQL generation is missing tool selection and schema validation logs

Tool Stage Instrumented? Details
Selection FAIL tool_selection_alternatives MISSING (which tools were considered)
Argument Gen FAIL schema_validation_result MISSING (was the SQL valid?)
Execution PASS tool_call_success, execution_latency present
Output Handling PARTIAL next_state present, how_output_used missing

State transitions reveal skipped stages in 8% of traces

Stage Flow Analysis
Built a grid showing transitions between pipeline stages (think of it as 'from this stage, where did the pipeline go next?'). Most transitions follow the expected flow, but three anomalous patterns appear with high frequency.
8% of traces skip sql_executed → 5% skip context_retrieved → 3% skip SQL entirely
Unexpected transitions
sql_generated → narrative_written (8%): skipped execution
query_parsed → sql_generated (5%): skipped retrieval
context_retrieved → narrative_written (3%): skipped SQL

Build a Delta Instrumentation Spec: what's missing, what it enables

Base (18-22 min)
  • Classify 6 stages by system type
  • Inspect v1 traces for retrieval + SQL
  • Identify answerable vs blocked questions
  • Extract state transitions, build matrix
  • Identify top 3 failure hotspots
  • Write Delta Spec artifact
Extend (DS/Eng, +10-15 min)
  • Implement state extraction from scratch
  • Experiment with strict vs lenient failure defs
  • Propose span schema change for high-priority field

State transition matrix heatmap with failure hotspots

Artifact: Transition Matrix + Hypotheses
Grid showing transition frequencies (where each row is 'where the pipeline was' and each column is 'where it went next'). Three red cells mark failure hotspots (>15% failure rate). Each annotated with 2-3 sentence hypothesis based on trace inspection.
Hotspot 1 (8%)
sql_generated → narrative_written: SQL validation failed silently, fell back to narrative
Hotspot 2 (5%)
query_parsed → sql_generated: Cached schema used, skipped retrieval stage
Hotspot 3 (3%)
context_retrieved → narrative_written: Question answerable without SQL execution

Two common failure modes when instrumenting systems

Over-Instrumenting
Trace with 40+ fields, most unused

Consequences: storage bloat, query slowdowns, your logging format changes over time without anyone noticing
Under-Instrumenting
Trace with only 5 fields

Consequences: blocked evaluations, manual debugging, ship blind
Balance principle
Instrument exactly what your system type requires for the evaluations you need to run. If you cannot name the evaluation question a field enables, do not log it yet.

Can you diagnose retrieval quality without candidate scores?

Scenario
Your team builds a RAG system. A PM asks: "Is our retrieval quality good enough to ship?" You inspect traces and find:
Present
retrieval_query
context_provided
Missing
candidate_documents (the documents your system found)
ranking_scores (how relevant each was scored)
Question
Can you answer the PM's question? Why or why not? What specific evaluation is blocked?

System-Type Instrumentation Framework: 3 patterns, 1 pipeline

LLM Apps
prompt → response

prompt template · model config · schema validation · token counts · finish_reason
"Did schema failures increase after config change?"
RAG Systems
retrieval → generation

retrieval_query · candidate_docs · ranking_scores · context_provided · citations
"Is retrieval quality sufficient?"
Agents
reasoning → tools → state

tool_selection · argument_gen · execution · output_handling · state_transitions
"Are tool selection or exec errors causing failures?"
Query Parse
LLM
Context Retrieval
RAG
SQL Gen/Exec
Agent
Chart + Narrative
LLM

Next: Designing instrumentation that scales with the product

What happens when your AI system grows from 6 stages to 20, from 1 model to 5, from 100 queries/day to 10,000?
AI Analyst Lab | AI Evals for Product Dev | Week 2 Lesson 5 | aianalystlab.ai