Pipeline architecture

Week 5 Lesson 1 AI Evals for Product Dev
Shane Butler AI Analyst Lab

Which metrics did you designate as "blocking" (must pass for ship), and what thresholds did you set based on v1 baseline performance?

Blocking Metrics
Thresholds
Take 30 seconds to answer from memory.

Every time you run an evaluation, you lose the results

Monday: SQL correctness 87%
Posted in Slack, no metadata saved
Wednesday: SQL correctness 83%
Different person, different run, posted in Slack
Question: Is this a regression?
Can't answer — no dataset version, model version, judge config, or reproduction steps
Missing:
Dataset version
Judge config
Model version
Reproduction steps

Evaluation pipeline is persistent infrastructure, not a script you run once

Without Pipeline
One-off script
  • Lost results
  • No history
  • Can't compare runs
With Pipeline
Defined infrastructure
  • Stored results
  • Queryable history
  • Version-controlled

Six stages: sampling → judging → aggregation → storage → reporting → alerts

1. Sampling
traces.sample()
2. Judging
metrics.judge()
3. Aggregation
metrics.aggregate()
4. Storage
warehouse.write_run()
5. Reporting
reports.metric_card()
6. Alerts
alerts.check_threshold()
Each stage has defined input/output — aieval library provides functions for each

Run metadata makes evaluation results interpretable

Field Example Value
run_id eval_2024_03_15_001
dataset_version traces_v1_full.jsonl
model_version v1.3
judge_version judge_v2.1
judge_config_hash a3f5b8c2
timestamp 2024-03-15 14:22:00
commit_sha 7d3a91f
sample_size 200
sampling_strategy random
Change any single field → different evaluation

Match evaluation depth to deployment stage

Stage Metrics Intensity
Offline (pre-prod) All metrics, all segments, all traces
100%
Shadow (parallel, no users) Sampled, blocking metrics only
50%
Beta Targeted segments, optimization tracked
30%
Experiment Statistical comparison, decision metrics
50%
Production Alerts on blocking, cost-constrained
15%

Can you confidently say Run B is better? What information from the run metadata would you need to know the 4-point improvement is real?

Run A: 78% pass rate
judge_v2.1
200 traces from traces_v1_full.jsonl
Run B: 82% pass rate
judge_v2.3
200 traces from traces_v1_full.jsonl
Question: Is Run B better?
Write down your prediction before running the next cell.

Walk through sampling to warehouse storage

Sample 50 traces
aieval.traces.sample(dataset="traces_v1_full.jsonl", n=50)
Apply oracle SQL check + LLM judge
Two-part scoring: execution correctness + narrative quality
Aggregate: 76% composite pass rate
Fraction of traces where both SQL and narrative pass
Package with run metadata + write to eval_runs
run_id, dataset_version, judge_version, sample_size, timestamp, commit_sha
Query back to verify
Confirm record exists with correct metadata in warehouse

Build the failure matrix — where the pipeline breaks

query intent context sql_gen sql_exec chart narrative
query - - - - - -
intent - - - - - -
context - - - - - -
sql_gen - - - - 23% - -
sql_exec - - - - - -
chart - - - - - - 18%
Hottest cells = where to focus debugging

Design Run B, query both runs, interpret the difference

Field Run A Run B
Pass rate 76% 81%
Dataset traces_v1_full.jsonl traces_v1_full.jsonl
Sample Random 50 Filtered: PM users only
Judge version judge_v2.1 judge_v2.1
Sample size 50 32
Interpretation
The 5-point improvement is attributable to filtering for PM users — they ask simpler queries with higher pass rates

Run full pipeline + transition matrix + Run B comparison

  • Run the worked example (Run A)
    Sample 50 traces, apply oracle + judge, aggregate, write to warehouse
  • Build transition failure matrix
    Identify top 2 failure transitions, connect to instrumentation from Lesson 2.5
  • Design and execute Run B
    Choose one deliberate change (segment, sample, threshold), run full pipeline, write with new run_id, query both runs
  • Assemble Evaluation Pipeline Blueprint artifact
Base: 20-25 min | Extend: +10-15 min (lineage tracking, custom queries)

Evaluation Pipeline Blueprint with transition failure heatmap

Transition Failure Heatmap
7x7 grid with heat colors
sql_generated → sql_executed: 23% failure rate
eval_runs Schema
run_id, dataset_version, judge_version, timestamp, commit_sha, sample_size
Run A
Run B
Queryable Time Series
Portfolio-ready: infrastructure for repeatable evaluation

Four ways evaluation infrastructure breaks

Failure Mode Consequence
Results live only in notebooks or Slack "Has SQL correctness improved?" → no one knows
Run metadata is incomplete Cannot tell if 87% → 83% is regression or methodology change
No lineage tracking "How has our eval evolved?" → reconstruct from memory
Treating all environments the same Full suite on prod = $40K/month in judge costs

Metadata decisions, heatmap interpretation, environment ladder

1. Metadata check
Run A: 87% with judge_v2.1. Run B: 91% with judge_v2.3. PM asks: "Should we ship based on this 4-point improvement?" What metadata would you check?
2. Heatmap diagnosis
Transition heatmap shows context_retrieved → sql_generated has 15% failure rate (highest in pipeline). What does this tell you?
3. Environment ladder
Colleague proposes: run full suite (500 traces, 8 metrics, 3 LLM judges) on every production request. Why is this infeasible? What's the alternative?

Pipeline stages + run metadata + transition matrix = decision evidence

Pipeline Stages
1. Sampling
2. Judging
3. Aggregation
4. Storage
5. Reporting
6. Alerts
Run Metadata
run_id
dataset_version
judge_version
timestamp
commit_sha
History tracking
Transition Matrix
7x7 heatmap
AI Data Analyst
pipeline states
Hottest cells
annotated
Decision Evidence: Run A vs. Run B comparison with full provenance

Next: Dataset Lifecycle

How evaluation datasets grow, version, and stay aligned to production without becoming stale
AI Analyst Lab | AI Evals for Product Dev | Week 5 Lesson 1 | aianalystlab.ai