What to log

Week 2 · Lesson 1 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

Which failure from the trace inspection exercise was hardest to diagnose with only input and output logged?

What you logged in that exercise
user_query
final_output
What intermediate data was missing?
Your answer here

Incomplete traces make every failure look the same

What engineering sees
A trace (log record) with minimal fields
user_query
final_output = "error"
latency_ms: 1847
What's actually happening
Pipeline with unknown failure point
intent → ?
retrieval → ?
SQL gen → ?
execution → ?
rendering → ?
narrative → ?

"What do you need us to log?" is the contract between product, evaluation, and engineering

Product
Diagnose user issues
Instrumentation
Contract
Engineering
Minimize storage cost,
work for the team
Evaluation
Measure quality,
validate metrics
Log too little
Evaluation impossible
Log too much
Drown in unused data

The Evaluability Logging Contract defines minimum fields to enable evaluation and ship decisions

C
Correlation Identifiers
Link traces to users, sessions, outcomes
V
Version Metadata
Attribute behavior to specific model/prompt versions
P
Pipeline Observability
Diagnose failures at stage level
J
Join Keys for Outcome Validation
Connect traces to business metrics
Four categories. One sprint to implement. Foundation for Weeks 3-6.
trace_id
Unique per request
The main ID that connects everything
session_id
Links multi-turn
Conversation context
user_id
Enables user joins
Behavior signals
Enables
Multi-turn conversation tracking
Enables
User-level segmentation
Enables
Joins to downstream outcomes
Without these, you can't join traces to what users actually did.

Version metadata makes behavior changes attributable to code changes

Timeline
T (regression)
Error rate: 2%
Error rate: 12%
model_id
Track model changes
before T: gpt-4
after T: gpt-4
✓ unchanged
<div style="background: rgba(217,119,6,0.08); border: 2px solid var(--dk-accent); border-radius: 10px; padding: 16px;">
  <div style="font-size: 16px; font-weight: 700; color: var(--dk-accent-light); margin-bottom: 8px;">prompt_template_version</div>
  <div style="font-size: 16px; color: var(--dk-text-secondary); margin-bottom: 12px;">Track prompt changes</div>
  <div style="font-size: 14px; font-family: monospace; background: var(--dk-elevated); padding: 8px; border-radius: 6px;">
    <div style="color: var(--dk-text-muted);">before T: v2.3</div>
    <div style="color: var(--dk-accent-light); margin-top: 4px; font-weight: 700;">after T: v2.4</div>
  </div>
  <div style="margin-top: 8px; font-size: 16px; font-weight: 700; color: var(--dk-accent-light); text-align: center;">← CHANGED</div>
</div>

<div style="background: var(--dk-surface); border: 1px solid var(--dk-border); border-radius: 10px; padding: 16px;">
  <div style="font-size: 16px; font-weight: 700; color: var(--dk-text); margin-bottom: 8px;">retrieval_config_version</div>
  <div style="font-size: 16px; color: var(--dk-text-secondary); margin-bottom: 12px;">Track retrieval config</div>
  <div style="font-size: 14px; font-family: monospace; background: var(--dk-elevated); padding: 8px; border-radius: 6px;">
    <div style="color: var(--dk-text-muted);">before T: v1.1</div>
    <div style="color: var(--dk-text-muted); margin-top: 4px;">after T: v1.1</div>
  </div>
  <div style="margin-top: 8px; font-size: 16px; color: var(--dk-text-secondary); text-align: center;">✓ unchanged</div>
</div>
Without version fields, you can't tell which change caused the regression.

Pipeline observability makes failures diagnosable at the stage level

intent
retrieval
SQL gen
execution
rendering
narrative
input
output
latency
status
Log these 4 fields per stage
Now you can see WHERE it broke. Diagnosis time drops from hours to seconds.

Predict: Which schema supports which metric?

Schema A
request_id
user_query
final_answer
latency_ms
success
timestamp
Schema B
request_id, user_query,
final_answer, latency_ms,
session_id, retrieval_query,
num_docs, generated_sql,
sql_success, model_version,
prompt_template_version
Predict:
1. Which schema(s) support SQL correctness metric?
2. Which schema(s) support user role segmentation (breaking down results by user type)?
Write your predictions before seeing the demo.

v0 trace: opaque error with no diagnostic detail

request_id: abc123
user_query: "Show revenue by region"
final_answer: "error"
latency_ms: 1847
success: false
timestamp: 2025-01-15T14:23:01Z
Diagnosis attempt
Can't tell what went wrong. Retrieval? SQL? Chart rendering? The trace gives you nothing.

v1 trace: diagnosis in 10 seconds from two fields

trace_id: abc123 | session_id: sess_789 | user_id: u_456
model_version: gpt-4 | prompt_template_version: v2.4
user_query: "Show revenue by region"
Pipeline stages (step-by-step data for each piece):
retrieval_query: "revenue region sales data"
num_docs_retrieved: 3
<div style="margin-top: 12px; background: var(--dk-positive-bg); border-left: 3px solid var(--dk-positive); padding: 12px; border-radius: 4px;">
  <div style="color: var(--dk-positive); font-weight: 700; margin-bottom: 4px;">generated_sql:</div>
  <div style="color: var(--dk-text-secondary); font-size: 16px;">SELECT revenue FROM sales s</div>
  <div style="color: var(--dk-text-secondary); font-size: 16px;">INNER JOIN regions</div>
  <div style="color: var(--dk-text-secondary); font-size: 16px;">ON s.region = regions.id  <span style="color: var(--dk-positive); font-weight: 700;">← broken query</span></div>
</div>

<div style="margin-top: 12px; background: var(--dk-positive-bg); border-left: 3px solid var(--dk-positive); padding: 12px; border-radius: 4px;">
  <div style="color: var(--dk-positive); font-weight: 700;">sql_error: <span style="font-weight: 400; color: var(--dk-text-secondary);">"column regions.id does not exist"</span></div>
</div>

<div style="color: var(--dk-negative); margin-top: 12px; font-weight: 700;"><span style="color: var(--dk-accent-light); font-weight: 400;">sql_success:</span> false</div>
Diagnosis: 10 seconds
The v1 schema made the failure visible. SQL generation stage produced broken JOIN syntax.

Build a Minimum Evaluability Logging Spec that engineering can implement

  • Review failure discovery exercise → identify fields needed for diagnosis
  • Fill spec template with 4 categories
  • Justify each field (one sentence)
  • Mark "minimum viable" vs "expansion"
  • Verify against 3 scenarios
Base track
20-25 min — produce the spec
Extend track
+10-15 min — write validation code

Side-by-side schema coverage unlocks Week 3 capability

v0 Schema
6 fields (30%)
<div>
  <div style="font-size: 16px; font-weight: 700; color: var(--dk-accent-light); margin-bottom: 16px; text-align: center;">v1 Schema</div>
  <div style="background: var(--dk-surface); border: 2px solid var(--dk-accent); border-radius: 12px; padding: 20px;">
    <div style="display: flex; flex-direction: column; gap: 8px;">
      <div style="background: var(--dk-accent); height: 40px; border-radius: 6px; width: 100%; display: flex; align-items: center; padding: 0 12px; font-size: 16px; color: var(--dk-bg); font-weight: 700;">Correlation IDs →</div>
      <div style="background: var(--dk-accent-secondary); height: 40px; border-radius: 6px; width: 100%; display: flex; align-items: center; padding: 0 12px; font-size: 16px; color: var(--dk-bg); font-weight: 700;">Version metadata →</div>
      <div style="background: var(--dk-accent-light); height: 40px; border-radius: 6px; width: 100%; display: flex; align-items: center; padding: 0 12px; font-size: 16px; color: var(--dk-bg); font-weight: 700;">Pipeline observability →</div>
      <div style="background: #F0A060; height: 40px; border-radius: 6px; width: 100%; display: flex; align-items: center; padding: 0 12px; font-size: 16px; color: var(--dk-bg); font-weight: 700;">Join keys →</div>
    </div>
    <div style="margin-top: 16px; text-align: center; font-size: 16px; color: var(--dk-accent-light); font-weight: 700;">15+ fields (100%)</div>
  </div>
</div>
Enables
Retrieval accuracy
Enables
User segmentation
Enables
Regression attribution

Logging outputs without intermediate steps makes all failures look identical

What you log
Only endpoints, no pipeline stages
user_query
final_output
What you can diagnose
All failures collapse into one category
"output doesn't match expected"
• Retrieval error?
• SQL generation error?
• Execution error?
• Hallucination?
All indistinguishable
Real-world evidence
When Datadog (a monitoring platform used by engineering teams) instrumented LLM chains, they discovered that logging only input and output made all failures look identical. After adding step-by-step logging for each piece of the pipeline, diagnosis time dropped from hours to minutes.

Engineering says "we can add fields later" — explain what capabilities are blocked and why that doesn't solve it

Past
Minimal schema
trace_id
user_query
final_output
<div style="text-align: center;">
  <div style="background: var(--dk-negative-bg); border: 2px solid var(--dk-negative); border-radius: 8px; padding: 12px 16px; white-space: nowrap; font-size: 16px; font-weight: 700; color: var(--dk-negative); margin-bottom: 8px;">
    Gap: old traces don't have new fields
  </div>
  <div style="font-size: 16px; color: var(--dk-text-muted);">
    Can't backfill → tracking trends over time breaks
  </div>
</div>

<div style="text-align: center;">
  <div style="font-size: 16px; font-weight: 700; color: var(--dk-text-secondary); margin-bottom: 8px;">Present</div>
  <div style="background: var(--dk-surface); border: 1px solid var(--dk-accent); border-radius: 8px; padding: 16px; width: 240px;">
    <div style="font-size: 16px; color: var(--dk-accent-light); margin-bottom: 8px;">New fields added</div>
    <div style="font-family: monospace; font-size: 16px; color: var(--dk-text-secondary);">trace_id<br>user_query<br>final_output<br><span style="color: var(--dk-accent-light);">session_id<br>generated_sql</span></div>
  </div>
</div>
Knowledge check:
What Week 3 capabilities are blocked by the minimal schema? Why doesn't "add fields later" solve the problem?

Four quadrants, one contract — without any quadrant, evaluation breaks in Week 3

Correlation Identifiers
trace_id
session_id
user_id
→ Multi-turn linking
→ User segmentation
→ Outcome joins
Version Metadata
model_version
prompt_template_version
retrieval_config_version
→ Regression attribution
→ Reproducibility
Pipeline Observability
6 stages x 4 fields each:
input
output
latency
status
→ Failure diagnosis
→ Stage-to-stage debugging
Join Keys
Trace fields link to:
user_behavior
business_outcomes
→ Checking metrics reflect reality
→ Having enough data to decide
Without any quadrant, evaluation breaks in Week 3.

Next: How little instrumentation is enough?

You have the contract. Now: what's the minimum viable implementation?
AI Analyst Lab | AI Evals for Product Dev | Week 2 Lesson 1 | aianalystlab.ai