Cost-Aware Evaluation

	Blocking Metrics	Optimization Metrics
Examples	SQL correctness Narrative faithfulness PII detection	Chart appropriateness Retrieval precision Response latency
Purpose	Gate release decisions	Track improvement trends
Budget Priority	Higher	Lower

Metric	Cost/Eval	Your Predicted Coverage	Monthly Cost
SQL correctness	$0.001	___%	$____
Retrieval precision	$0.01	___%	$____
Narrative faithfulness	$0.05	___%	$____
TOTAL			$____ (must be ≤ $3,000)

Metric	Type	Cost/Eval	Daily Cost	Monthly Cost
PII detection	B	$0.000	$0	$0
Response latency	O	$0.000	$0	$0
SQL correctness	B	$0.001	$10	$300
Retrieval precision	O	$0.010	$100	$3,000
Narrative faithfulness	B	$0.050	$500	$15,000
Chart appropriateness	O	$0.050	$500	$15,000
TOTAL			$1,110	$33,300

Metric	Safety-Critical	Business-Critical	General	Tail
PII detection	100%	100%	100%	100%
SQL correctness	100%	50%	30%	10%
Narrative faithfulness	100%	50%	20%	5%
Retrieval precision	50%	20%	10%	5%
Chart appropriateness	20%	10%	5%	1%
Response latency	100%	100%	100%	100%

Metric	Type	Avg Coverage	Cost/Eval	Monthly Cost
PII detection	B	100%	$0.000	$0
Response latency	O	100%	$0.000	$0
SQL correctness	B	50%	$0.001	$150
Retrieval precision	O	14%	$0.010	$420
Narrative faithfulness	B	30%	$0.050	$4,500
Chart appropriateness	O	7%	$0.050	$1,050
TOTAL				$6,120

Segment	% of Traffic	Your Allocated Coverage
Onboarding	5%	___%
Core use case	70%	___%
Tail	25%	___%

Cost-Aware Evaluation Strategy

Traffic Segmentation (10,000 daily queries)

Safety 10%

Business 30%

General 50%

Tail 10%

Metric Coverage by Segment

PII detection

SQL correctness

Narrative faith.

Retrieval prec.

Chart appropr.

Cost Breakdown

Baseline

$33.3k

Allocated

$2.2k

Cascade: Narrative Faithfulness

Tier 1: Rules ($0)

100% → 15% fail

↓

Tier 2: Small AI ($0.003)

85% → 60% resolved

↓

Tier 3: Frontier ($0.05)

25% → final score

Avg cost: $0.0143 (71% reduction)

Stratified sampling + tiered cascades reduce costs 93% while preserving 100% coverage for safety-critical segments.

Cost-Aware Evaluation

Which metrics did you classify as blocking vs. optimization, and why does that distinction matter for release decisions?

You're spending 10x on measurement what you spend on running the system

Uniform sampling wastes budget on low-stakes evaluation while high-stakes metrics are under-covered

Evaluate more where the decisions are higher stakes

Run cheap checks first, escalate to expensive checks only when necessary

Blocking metrics get higher budget allocation because they gate ship decisions

Work backward from the budget constraint to maximize decision quality

What coverage rate can you afford for each metric if you want to evaluate all three?

Two metrics account for 90% of the total cost

Blocking metrics on high-risk segments get the highest coverage

Stratified sampling alone cuts costs by 82%, but you need 90%

A well-designed cascade reduces average cost-per-eval by 71%

Implement the 3-tier cascade and assemble the Cost Allocation Plan

Cost Allocation Plan: 93% reduction, 100% safety coverage

Uniform sampling misses 90% of safety-critical failures

How would you allocate 20% budget across three traffic segments?

Cost-Aware Evaluation Strategy

Next: Segmentation strategy for AI systems