What to log

Version metadata makes behavior changes attributable to code changes

Timeline

T (regression)

Error rate: 2%

Error rate: 12%

model_id

Track model changes

before T: gpt-4
after T: gpt-4

✓ unchanged

<div style="background: rgba(217,119,6,0.08); border: 2px solid var(--dk-accent); border-radius: 10px; padding: 16px;">
  <div style="font-size: 16px; font-weight: 700; color: var(--dk-accent-light); margin-bottom: 8px;">prompt_template_version</div>
  <div style="font-size: 16px; color: var(--dk-text-secondary); margin-bottom: 12px;">Track prompt changes</div>
  <div style="font-size: 14px; font-family: monospace; background: var(--dk-elevated); padding: 8px; border-radius: 6px;">
    <div style="color: var(--dk-text-muted);">before T: v2.3</div>
    <div style="color: var(--dk-accent-light); margin-top: 4px; font-weight: 700;">after T: v2.4</div>
  </div>
  <div style="margin-top: 8px; font-size: 16px; font-weight: 700; color: var(--dk-accent-light); text-align: center;">← CHANGED</div>
</div>

<div style="background: var(--dk-surface); border: 1px solid var(--dk-border); border-radius: 10px; padding: 16px;">
  <div style="font-size: 16px; font-weight: 700; color: var(--dk-text); margin-bottom: 8px;">retrieval_config_version</div>
  <div style="font-size: 16px; color: var(--dk-text-secondary); margin-bottom: 12px;">Track retrieval config</div>
  <div style="font-size: 14px; font-family: monospace; background: var(--dk-elevated); padding: 8px; border-radius: 6px;">
    <div style="color: var(--dk-text-muted);">before T: v1.1</div>
    <div style="color: var(--dk-text-muted); margin-top: 4px;">after T: v1.1</div>
  </div>
  <div style="margin-top: 8px; font-size: 16px; color: var(--dk-text-secondary); text-align: center;">✓ unchanged</div>
</div>

Without version fields, you can't tell which change caused the regression.

What to log

Which failure from the trace inspection exercise was hardest to diagnose with only input and output logged?

Incomplete traces make every failure look the same

"What do you need us to log?" is the contract between product, evaluation, and engineering

The Evaluability Logging Contract defines minimum fields to enable evaluation and ship decisions

Correlation identifiers link traces to users and outcomes

Version metadata makes behavior changes attributable to code changes

Pipeline observability makes failures diagnosable at the stage level

Predict: Which schema supports which metric?

v0 trace: opaque error with no diagnostic detail

v1 trace: diagnosis in 10 seconds from two fields

Build a Minimum Evaluability Logging Spec that engineering can implement

Side-by-side schema coverage unlocks Week 3 capability

Logging outputs without intermediate steps makes all failures look identical

Engineering says "we can add fields later" — explain what capabilities are blocked and why that doesn't solve it

Four quadrants, one contract — without any quadrant, evaluation breaks in Week 3

Next: How little instrumentation is enough?