Cost-latency-quality tradeoffs

Week 1 · Lesson 5 AI Evals for Product Dev
Shane Butler · AI Analyst Lab

If pass@k is high but reliable@k is low, what does that tell you about shipping?

Capability (pass@k) = HIGH
The system can produce correct answers — at least 1 of k trials succeeds.
Consistency (reliable@k) = LOW
Users experience inconsistency — most sessions include at least one failure.
Now add cost and latency to the picture. The tradeoffs compound.

Three constraints, zero configurations that satisfy all of them

<2s
Latency SLA
95% of responses (p95) must be under 2 seconds.
$500
Budget Constraint
$500/month maximum at 10K queries/day.
85%
Quality Floor
SQL correctness (does the AI write working queries?) must be at least 85%.
The AI Data Analyst v0 uses a frontier model for all operations. Cost: ~$0.08/query ($2,400/month at scale). Latency: p95 of 3.2s. Quality: 88% SQL correctness. It meets quality — but violates cost by 5x and latency by 60%.

Picking the "best" model once and never revisiting is how teams overspend

Pick model
Most capable you can afford
->
Ship it
Move to next feature
->
Never revisit
Budget blown. Latency violated. Quality degraded.
Cost, latency, and quality are not independent dials. They form a tradeoff surface: improving one dimension often means sacrificing another.

Cost, latency, and quality form a surface — improving one means sacrificing another

Cost-Latency-Quality Frontier
The best possible tradeoffs. Every system lives on or below this frontier.
On the Frontier
Improving one dimension requires sacrificing another. These are true tradeoffs worth debating.
Below the Frontier
Dominated configurations — another config is better without being worse. Replace without debate.
The question is not 'which model is best?' — it is 'which position on the frontier satisfies my constraints?'

Quick check: Is GPT-4o-mini dominated by GPT-4o?

GPT-4o-mini is 10x cheaper and twice as fast — but scores 6 points lower on quality. Is GPT-4o dominated by mini? Think about it before advancing.
The answer depends on something you have not considered yet.

A dominated configuration is strictly worse — replace it without debate

DOMINATED
Configuration A is dominated by B if B is better on at least one dimension and no worse on the others.
Whether this is 'dominated' depends on whether the quality floor is a hard constraint.
TRUE TRADEOFF
Both sit on the Pareto frontier. Improving one dimension requires accepting a loss in another.
Constraints break the tie.
Switching from GPT-4o to mini cuts cost by 10x and latency by half — but drops quality 6 points below the 85% floor. Context determines whether a comparison is dominance or a true tradeoff.

Route 80% of queries to cheap models, reserve expensive models for the 20% that need them

Query
Incoming
->
Complexity Filter
Simple or complex?
->
Simple (~80%)
GPT-4o-mini $0.008
or
Complex (~20%)
GPT-4o $0.08
->
Response
To user
Routing uses observable query features (query length, question type) — not an AI quality check per query.

A $500/month feature cannot have a $5K/month evaluation pipeline

Evaluation Cost Blindness
Routing saves $500/month on inference but requires $2K/month in automated quality checks (judge calls). Net: +$1,500/month.
Constraint-Aware Evaluation
Route on observable features. Validate with a 10-20% sample. Evaluation cost stays inside budget.
Budget constraints affect how many automated quality checks you can afford, which determines metric coverage. Your evaluation strategy must match your product constraints.

Will the hybrid cut cost by 10%, 50%, or 90%?

The AI Data Analyst v0 uses GPT-4o for all operations: SQL generation, narrative generation, chart decisions. You switch SQL generation only to GPT-4o-mini (10x cheaper). GPT-4o stays for narrative and charts.
  • A. Cost drops ~10% — SQL is a small fraction of total cost
  • B. Cost drops ~50% — SQL is roughly half the work
  • C. Cost drops ~90% — SQL is the dominant cost driver
Write your prediction and reasoning before the next slide.

v0 violates cost by 5x and latency by 60% — but meets the quality floor

Configuration $/Query Monthly p95 Latency Quality
All GPT-4o (v0) $0.08 $2,400 3.2s 88%
All GPT-4o-mini $0.008 $240 1.6s 82%
All Gemini Flash $0.005 $150 1.2s 80%
No single-model configuration satisfies all three constraints. v0 meets quality but blows cost and latency. Cheaper models meet cost and latency but fail the quality floor.

No single model satisfies all three constraints

v0: ALL GPT-4o
Cost: $0.08/query — FAIL (5x over)
Latency: 3.2s p95 — FAIL (60% over)
Quality: 88% — PASS
GPT-4o-mini
Cost: $0.008/query — PASS
Latency: 1.6s p95 — PASS
Quality: 82% — FAIL (3pt below)
v0 is dominated by mini on cost and latency. But the quality floor constraint changes the calculus — context determines whether a comparison is dominance or a true tradeoff.

The hybrid saves ~40%, not 90% — SQL is only 40% of total compute

Predicted
~90%
Actual
~40%
Why?
SQL = 40%
SQL (switched to mini)
Narrative + charts (still GPT-4o)
10x cheaper per query does not mean 10x cheaper overall. $0.08 baseline to ~$0.048 hybrid. Saves ~$960/month, not ~$2,160. The savings only apply to the fraction of compute you switch.

Benchmark three configurations and fill the Decision Template

Base Version (all students, 18-22 min)
1. Calculate cost/query from v0 traces (many have missing data)
2. Calculate latency distribution (p50, p95, p99)
3. Calculate SQL execution success rate
4. Populate benchmark table (v0, mini, Flash)
5. Identify dominated configurations
6. Fill Model Selection Decision Template
Extend Version (DS/Eng, +10-15 min)
1. Analyze query distribution by question type
2. Define routing rule in pseudocode
3. Compute weighted cost/latency/quality
4. Add routing config row to benchmark table
5. Refine routing rule to meet all constraints

Your benchmark table and decision template are the evidence for Week 4

Model Selection Decision Template
Recommended
Hybrid routing / all mini / renegotiate floor
Constraint Checks
Cost PASS/FAIL · Latency PASS/FAIL · Quality PASS/FAIL
Tradeoff Reasoning
What you sacrifice and why it is acceptable
Risks
What could change this recommendation
What I built
A model selection decision with benchmarks and explicit tradeoff reasoning. This table is the input for Week 4 release criteria.

Three ways teams make bad model decisions

Failure Mode What Teams Do What Goes Wrong
Optimize one dimension Choose cheapest: all mini saves $2,160/mo Quality drops to 82%, below 85% floor
Confuse averages with tails Report "latency improved to 1.5s" (mean) p95 regressed from 3.2s to 4.1s
Evaluation cost blindness Routing saves $500/mo, quality checks cost $2K/mo Net cost: +$1,500/month
All three failures share one root cause: treating dimensions as independent when they interact. The cost-latency-quality framework forces you to check all three simultaneously.

Three tradeoff scenarios that require judgment, not lookup

Scenario 1
A model has mean latency 1.2s, p95 latency 2.8s. Your SLA is "<2s for 95% of queries." Does it meet the SLA?
Scenario 2
Switching from GPT-4o to mini saves $4K/month on inference but costs $1.5K/month in automated quality checks. Net savings? What tradeoff are you making?
Scenario 3
A colleague says "Use the cheapest model that meets our quality floor." What are two risks this heuristic misses?

Four configurations, three constraints, one feasible region

Configuration Cost Latency Quality Verdict
All GPT-4o $0.08 3.2s 88% FAIL cost+lat
All GPT-4o-mini $0.008 1.6s 82% FAIL quality
All Gemini Flash $0.005 1.2s 80% FAIL quality
Hybrid routing ~$0.03 ~1.8s ~86% PASS (all 3)
Only hybrid routing lands in the feasible region. No single model satisfies all three constraints — the solution requires splitting your traffic.

Next: What to log and why it matters

You measured cost, latency, and quality from v0 traces — but those traces had missing data. Many were missing token counts, model names, and response times. Your benchmark table is built on incomplete evidence. Next lesson: what every trace needs, what schema to use, and how to design instrumentation that supports the evaluation system you are building.
AI Analyst Lab™ | AI Evals for Product Dev | Week 1 · Lesson 5 | aianalystlab.ai