Cost-Aware Evaluation

Week 4 Lesson 3 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

Which metrics did you classify as blocking vs. optimization, and why does that distinction matter for release decisions?

Blocking Metrics
Optimization Metrics
Why does this classification gate ship decisions?

You're spending 10x on measurement what you spend on running the system

Monthly inference costs
$3,000
Monthly evaluation costs
$30,000
LLM judges
Other: $3k
$33,000
Leadership: "Cut eval costs by 90%, but don't lose visibility into quality."

Uniform sampling wastes budget on low-stakes evaluation while high-stakes metrics are under-covered

10% uniform sampling applied to all traffic
PII violations (0.1% rate)
→ 1,000 queries checked/day
→ Expect ~1 violation/day
→ 90% chance of missing it
General traffic quality
→ Same 10% coverage
→ Budget wasted on low-stakes evaluation
Your budget is wasted on low-stakes evaluation while high-stakes metrics are under-covered.

Evaluate more where the decisions are higher stakes

Safety-critical (10% of traffic)
100%
Business-critical (30% of traffic)
40%
General (50% of traffic)
10%
Tail (10% of traffic)
1%

Run cheap checks first, escalate to expensive checks only when necessary

Tier 1: Rule-based filters ($0)
100% of traffic → Filter obvious failures
Tier 2: Small AI judges ($0.001-0.003)
85% pass Tier 1 → Resolve clear cases
Tier 3: Frontier AI judges ($0.05)
~25% escalate → Handle ambiguous cases
A well-designed cascade reduces average cost per eval by 5-20x.

Blocking metrics get higher budget allocation because they gate ship decisions

Blocking Metrics Optimization Metrics
Examples SQL correctness
Narrative faithfulness
PII detection
Chart appropriateness
Retrieval precision
Response latency
Purpose Gate release decisions Track improvement trends
Budget Priority Higher Lower
Blocking metrics gate ship/ramp/hold/rollback decisions → higher budget priority.

Work backward from the budget constraint to maximize decision quality

Monthly cost = Σ (cost_per_eval × daily_traffic × coverage_rate × 30 days)
Given
$3,000/month budget
Given
10,000 queries/day
Given
6 metrics with different costs
Solve for
Coverage rates that maximize decision quality within budget

What coverage rate can you afford for each metric if you want to evaluate all three?

Metric Cost/Eval Your Predicted Coverage Monthly Cost
SQL correctness $0.001 ___% $____
Retrieval precision $0.01 ___% $____
Narrative faithfulness $0.05 ___% $____
TOTAL $____ (must be ≤ $3,000)
Write your prediction before running the next cell.

Two metrics account for 90% of the total cost

Metric Type Cost/Eval Daily Cost Monthly Cost
PII detection B $0.000 $0 $0
Response latency O $0.000 $0 $0
SQL correctness B $0.001 $10 $300
Retrieval precision O $0.010 $100 $3,000
Narrative faithfulness B $0.050 $500 $15,000
Chart appropriateness O $0.050 $500 $15,000
TOTAL $1,110 $33,300
LLM judges dominate the budget. You need a 90% reduction.

Blocking metrics on high-risk segments get the highest coverage

Metric Safety-Critical Business-Critical General Tail
PII detection 100% 100% 100% 100%
SQL correctness 100% 50% 30% 10%
Narrative faithfulness 100% 50% 20% 5%
Retrieval precision 50% 20% 10% 5%
Chart appropriateness 20% 10% 5% 1%
Response latency 100% 100% 100% 100%
Notice: PII detection gets 100% everywhere (costs $0). SQL correctness (blocking) gets higher coverage than chart appropriateness (optimization). Safety-critical segments get 100% coverage for all blocking metrics.

Stratified sampling alone cuts costs by 82%, but you need 90%

Metric Type Avg Coverage Cost/Eval Monthly Cost
PII detection B 100% $0.000 $0
Response latency O 100% $0.000 $0
SQL correctness B 50% $0.001 $150
Retrieval precision O 14% $0.010 $420
Narrative faithfulness B 30% $0.050 $4,500
Chart appropriateness O 7% $0.050 $1,050
TOTAL $6,120
82% reduction ($33,300 → $6,120). Still over budget. Tiered cascade completes the reduction.

A well-designed cascade reduces average cost-per-eval by 71%

100% of sampled traces
Tier 1: Rule check ($0)
15% fail | 85% pass
Tier 2: Small AI judge ($0.003)
60% resolved | 25% ambiguous
Tier 3: Frontier AI judge ($0.05)
Final score
Weighted avg: (15% fail × $0) + (60% resolved × $0.003) + (25% escalated × $0.05) = $0.0143
71% reduction from $0.05/eval

Implement the 3-tier cascade and assemble the Cost Allocation Plan

Base Version (20-25 min)
  • Implement the 3-tier cascade for narrative faithfulness
  • Tune the confidence threshold to eliminate 80%+ frontier calls
  • Fill in Cost Allocation Plan template
Extend Version (10-15 min)
  • 2-tier cascade (rule → frontier) cost-quality comparison
  • Sensitivity analysis: 3 scenarios
  • Importance sampling: oversample disagreement cases

Cost Allocation Plan: 93% reduction, 100% safety coverage

Baseline
$33,300/mo
LLM judges dominate
Allocated
Tiered LLM: $1,287
$300
$2,157/mo
93% reduction
100% coverage maintained for safety-critical segments on all blocking metrics.

Uniform sampling misses 90% of safety-critical failures

Uniform Sampling
10% coverage across all segments
PII violation rate: 0.1%
1,000 queries checked/day · ~1 violation expected/day
10% chance of catching it
Miss 27 of 30 violations per month
Stratified Sampling
100% coverage for safety-critical segments
All PII violations caught
Budget allocated to high-stakes decisions
General traffic sampled at lower rate

How would you allocate 20% budget across three traffic segments?

Segment % of Traffic Your Allocated Coverage Reasoning
Onboarding 5% ___%
Core use case 70% ___%
Tail 25% ___%
Constraint
Total budget: 20% average coverage for narrative faithfulness LLM judge.

Cost-Aware Evaluation Strategy

Traffic Segmentation (10,000 daily queries)
Safety 10%
Business 30%
General 50%
Tail 10%
Metric Coverage by Segment
PII detection
SQL correctness
Narrative faith.
Retrieval prec.
Chart appropr.
Cost Breakdown
Baseline
$33.3k
Allocated
$2.2k
Cascade: Narrative Faithfulness
Tier 1: Rules ($0)
100% → 15% fail
Tier 2: Small AI ($0.003)
85% → 60% resolved
Tier 3: Frontier ($0.05)
25% → final score
Avg cost: $0.0143 (71% reduction)
Stratified sampling + tiered cascades reduce costs 93% while preserving 100% coverage for safety-critical segments.

Next: Segmentation strategy for AI systems

You've allocated budget by segment risk. Next: how to define those segments so they capture the right failure modes.
AI Analyst Lab | AI Evals for Product Dev | Week 4 Lesson 3 | aianalystlab.ai