Cost-latency-quality tradeoffs

Configuration	$/Query	Monthly	p95 Latency	Quality
All GPT-4o (v0)	$0.08	$2,400	3.2s	88%
All GPT-4o-mini	$0.008	$240	1.6s	82%
All Gemini Flash	$0.005	$150	1.2s	80%

Benchmark three configurations and fill the Decision Template

Base Version (all students, 18-22 min)

1. Calculate cost/query from v0 traces (many have missing data)
2. Calculate latency distribution (p50, p95, p99)
3. Calculate SQL execution success rate
4. Populate benchmark table (v0, mini, Flash)
5. Identify dominated configurations
6. Fill Model Selection Decision Template

Extend Version (DS/Eng, +10-15 min)

1. Analyze query distribution by question type
2. Define routing rule in pseudocode
3. Compute weighted cost/latency/quality
4. Add routing config row to benchmark table
5. Refine routing rule to meet all constraints

Failure Mode	What Teams Do	What Goes Wrong
Optimize one dimension	Choose cheapest: all mini saves $2,160/mo	Quality drops to 82%, below 85% floor
Confuse averages with tails	Report "latency improved to 1.5s" (mean)	p95 regressed from 3.2s to 4.1s
Evaluation cost blindness	Routing saves $500/mo, quality checks cost $2K/mo	Net cost: +$1,500/month

Cost-latency-quality tradeoffs

If pass@k is high but reliable@k is low, what does that tell you about shipping?

Three constraints, zero configurations that satisfy all of them

Picking the "best" model once and never revisiting is how teams overspend

Cost, latency, and quality form a surface — improving one means sacrificing another

Quick check: Is GPT-4o-mini dominated by GPT-4o?

A dominated configuration is strictly worse — replace it without debate

Route 80% of queries to cheap models, reserve expensive models for the 20% that need them

A $500/month feature cannot have a $5K/month evaluation pipeline

Will the hybrid cut cost by 10%, 50%, or 90%?

v0 violates cost by 5x and latency by 60% — but meets the quality floor

No single model satisfies all three constraints

The hybrid saves ~40%, not 90% — SQL is only 40% of total compute

Benchmark three configurations and fill the Decision Template

Your benchmark table and decision template are the evidence for Week 4

Three ways teams make bad model decisions

Three tradeoff scenarios that require judgment, not lookup

Four configurations, three constraints, one feasible region

Next: What to log and why it matters

Configuration	Cost	Latency	Quality	Verdict
All GPT-4o	$0.08	3.2s	88%	FAIL cost+lat
All GPT-4o-mini	$0.008	1.6s	82%	FAIL quality
All Gemini Flash	$0.005	1.2s	80%	FAIL quality
Hybrid routing	~$0.03	~1.8s	~86%	PASS (all 3)