Cost-Aware Evaluation
Week 4 Lesson 3 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab
Which metrics did you classify as blocking vs. optimization, and why does that distinction matter for release decisions?
Blocking Metrics
Optimization Metrics
Why does this classification gate ship decisions?
You're spending 10x on measurement what you spend on running the system
Monthly inference costs
$3,000
Monthly evaluation costs
$30,000
LLM judges
Other: $3k
$33,000
Leadership: "Cut eval costs by 90%, but don't lose visibility into quality."
Uniform sampling wastes budget on low-stakes evaluation while high-stakes metrics are under-covered
10% uniform sampling applied to all traffic
PII violations (0.1% rate)
→ 1,000 queries checked/day
→ Expect ~1 violation/day
→ 90% chance of missing it
General traffic quality
→ Same 10% coverage
→ Budget wasted on low-stakes evaluation
Your budget is wasted on low-stakes evaluation while high-stakes metrics are under-covered.
Evaluate more where the decisions are higher stakes
Safety-critical
(10% of traffic)
100%
Business-critical
(30% of traffic)
40%
General
(50% of traffic)
10%
Tail
(10% of traffic)
1%
Run cheap checks first, escalate to expensive checks only when necessary
Tier 1: Rule-based filters ($0)
100% of traffic → Filter obvious failures
↓
Tier 2: Small AI judges ($0.001-0.003)
85% pass Tier 1 → Resolve clear cases
↓
Tier 3: Frontier AI judges ($0.05)
~25% escalate → Handle ambiguous cases
A well-designed cascade reduces average cost per eval by 5-20x.
Blocking metrics get higher budget allocation because they gate ship decisions
Blocking Metrics
Optimization Metrics
Examples
SQL correctness
Narrative faithfulness
PII detection
Chart appropriateness
Retrieval precision
Response latency
Purpose
Gate release decisions
Track improvement trends
Budget Priority
Higher
Lower
Blocking metrics gate ship/ramp/hold/rollback decisions → higher budget priority.
Work backward from the budget constraint to maximize decision quality
Monthly cost = Σ (cost_per_eval × daily_traffic × coverage_rate × 30 days)
Given
$3,000/month budget
Given
10,000 queries/day
Given
6 metrics with different costs
↓
Solve for
Coverage rates that maximize decision quality within budget
What coverage rate can you afford for each metric if you want to evaluate all three?
Metric
Cost/Eval
Your Predicted Coverage
Monthly Cost
SQL correctness
$0.001
___%
$____
Retrieval precision
$0.01
___%
$____
Narrative faithfulness
$0.05
___%
$____
TOTAL
$____ (must be ≤ $3,000)
Write your prediction before running the next cell.
Two metrics account for 90% of the total cost
Metric
Type
Cost/Eval
Daily Cost
Monthly Cost
PII detection
B
$0.000
$0
$0
Response latency
O
$0.000
$0
$0
SQL correctness
B
$0.001
$10
$300
Retrieval precision
O
$0.010
$100
$3,000
Narrative faithfulness
B
$0.050
$500
$15,000
Chart appropriateness
O
$0.050
$500
$15,000
TOTAL
$1,110
$33,300
LLM judges dominate the budget. You need a 90% reduction.
Blocking metrics on high-risk segments get the highest coverage
Metric
Safety-Critical
Business-Critical
General
Tail
PII detection
100%
100%
100%
100%
SQL correctness
100%
50%
30%
10%
Narrative faithfulness
100%
50%
20%
5%
Retrieval precision
50%
20%
10%
5%
Chart appropriateness
20%
10%
5%
1%
Response latency
100%
100%
100%
100%
Notice: PII detection gets 100% everywhere (costs $0). SQL correctness (blocking) gets higher coverage than chart appropriateness (optimization). Safety-critical segments get 100% coverage for all blocking metrics.
Stratified sampling alone cuts costs by 82%, but you need 90%
Metric
Type
Avg Coverage
Cost/Eval
Monthly Cost
PII detection
B
100%
$0.000
$0
Response latency
O
100%
$0.000
$0
SQL correctness
B
50%
$0.001
$150
Retrieval precision
O
14%
$0.010
$420
Narrative faithfulness
B
30%
$0.050
$4,500
Chart appropriateness
O
7%
$0.050
$1,050
TOTAL
$6,120
82% reduction ($33,300 → $6,120). Still over budget. Tiered cascade completes the reduction.
A well-designed cascade reduces average cost-per-eval by 71%
100% of sampled traces
↓
Tier 1: Rule check ($0)
15% fail | 85% pass
↓
Tier 2: Small AI judge ($0.003)
60% resolved | 25% ambiguous
↓
Tier 3: Frontier AI judge ($0.05)
Final score
Weighted avg: (15% fail × $0) + (60% resolved × $0.003) + (25% escalated × $0.05) =
$0.0143
71% reduction from $0.05/eval
Implement the 3-tier cascade and assemble the Cost Allocation Plan
Base Version (20-25 min)
Implement the 3-tier cascade for narrative faithfulness
Tune the confidence threshold to eliminate 80%+ frontier calls
Fill in Cost Allocation Plan template
Extend Version (10-15 min)
2-tier cascade (rule → frontier) cost-quality comparison
Sensitivity analysis: 3 scenarios
Importance sampling: oversample disagreement cases
Cost Allocation Plan: 93% reduction, 100% safety coverage
Baseline
$33,300/mo
LLM judges dominate
Allocated
Tiered LLM: $1,287
$300
$2,157/mo
93% reduction
100% coverage maintained for safety-critical segments on all blocking metrics.
Uniform sampling misses 90% of safety-critical failures
Uniform Sampling
10% coverage across all segments
PII violation rate:
0.1%
1,000 queries checked/day · ~1 violation expected/day
10% chance of catching it
Miss 27 of 30 violations per month
Stratified Sampling
100% coverage for safety-critical segments
All PII violations caught
Budget allocated to high-stakes decisions
General traffic sampled at lower rate
How would you allocate 20% budget across three traffic segments?
Segment
% of Traffic
Your Allocated Coverage
Reasoning
Onboarding
5%
___%
Core use case
70%
___%
Tail
25%
___%
Constraint
Total budget: 20% average coverage for narrative faithfulness LLM judge.
Cost-Aware Evaluation Strategy
Traffic Segmentation (10,000 daily queries)
Safety 10%
Business 30%
General 50%
Tail 10%
Metric Coverage by Segment
PII detection
SQL correctness
Narrative faith.
Retrieval prec.
Chart appropr.
Cost Breakdown
Baseline
$33.3k
Allocated
$2.2k
Cascade: Narrative Faithfulness
Tier 1: Rules ($0)
100% → 15% fail
↓
Tier 2: Small AI ($0.003)
85% → 60% resolved
↓
Tier 3: Frontier ($0.05)
25% → final score
Avg cost: $0.0143 (71% reduction)
Stratified sampling + tiered cascades reduce costs 93% while preserving 100% coverage for safety-critical segments.
Next: Segmentation strategy for AI systems
You've allocated budget by segment risk. Next: how to define those segments so they capture the right failure modes.