Scaling instrumentation

Week 2 Lesson 6 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

If you deployed your L2.3 trace spec to 100K queries/day, what cost and operational challenges would you face?

Think about platform costs (ingestion, indexing, querying), automated evaluators running on every trace.

100K queries/day × $0.02/trace = $60,000/month before evaluators

Metric Value
Queries per day 100,000
Traces per month (full logging) 3,000,000
All-in cost per trace (managed platform) $0.02
Monthly cost (full logging) $60,000
Before: automated evaluators, retention overhead, query costs on that data

The Sampling Strategy Design Triangle constrains what's feasible

Evidence Sufficiency Can you catch the failures that matter? Cost Constraints platform budget, retention policy Operational Reliability Logging without slowing down users Feasible Strategy

Detection requirements: rare failures drive minimum sample size

Common failures (5-15% base rate)
  • Retrieval errors: 15%
  • Hallucination: 15%
  • SQL syntax: 12%
  • Empty results: 8%
Easy to detect — show up in small samples
Rare failures (0.5-2% base rate)
  • Context loss (multi-turn): 2%
  • Security violations: 1%
  • Numeric overflows: 0.5%
Need 500+ samples for 90% confidence
The rarest type of failure is the one that determines your minimum sampling rate

Cost constraints: budget determines max sampling rate

Observability budget: $250/month
÷ All-in cost per trace: $0.02 = 12,500 traces/month
Total traces at full logging: 3,000,000/month
12,500 ÷ 3,000,000 = 0.42% max sampling rate
Budget caps sampling at 0.42% — detection needs 8%+ → tradeoff required

Operational constraints: logging can't degrade latency

User Request
Handler
(non-blocking)
Response to User
Background Queue
waits only at the end
Background Workers
Storage
saves everything before shutdown
Zero user-facing latency impact
Logging that blocks users adds 50ms per request → 5s waiting per user per day

Dynamic sampling allocates budget to high-risk periods

Pre-deploy
10%
Deploy (48h)
50%
Steady state
10%
Error spike
25%
Q4 surge
15%
Allocate budget where risk is highest

Trace sampling and eval sampling are two separate decisions

Decision Layer Constraint Example Policy Cost Impact
Trace sampling (what to store) Platform cost 10% sampling $6,000/month platform
Eval sampling (which stored traces to run quality checks on) Compute cost 50% of stored traces $5/day automated check calls
Coordinated, not conflated
Sampling 10% for storage but judging 100% of stored traces doubles automated check costs when you change trace sampling to 20%

What minimum sampling rate detects a 0.5% failure with 90% confidence?

System: 100,000 queries per day
Failure modes: 10 categories
Rarest failure: 0.5% base rate (how often it occurs)
Target: 90% detection confidence (you'd catch it 9 out of 10 times)

Minimum sampling rate = ?
Write your prediction before running Cell 7

The rarest failure mode (0.5%) requires at least 8% sampling

Failure Mode Base rate (how often it occurs) Samples Needed (90% confidence) Min % of queries to log
Retrieval errors 15% 50 0.05%
Hallucination 15% 50 0.05%
SQL syntax 12% 60 0.06%
Context loss 2% 350 0.35%
Security violations 1% 450 0.45%
Numeric overflows 0.5% 500 0.5%
Binding constraint: need 500 samples = 0.5% min sampling. Practical target: 8-10% with margin for variance.

Cost-volume-quality frontier shows 10% as the operating point

$60K
$0
<!-- Y-axis labels (right: coverage) -->
<div style="position: absolute; right: 0; top: 0; font-size: 14px; color: var(--dk-text-muted);">100%</div>
<div style="position: absolute; right: 0; bottom: 0; font-size: 14px; color: var(--dk-text-muted);">0%</div>

<!-- Chart area -->
<div style="margin: 0 60px; height: 280px; position: relative; border-left: 1px solid var(--dk-border); border-bottom: 1px solid var(--dk-border);">
  <!-- Storage cost line (red) -->
  <svg style="position: absolute; width: 100%; height: 100%;">
    <line x1="0" y1="280" x2="100%" y2="0" stroke="var(--dk-negative)" stroke-width="3" opacity="0.8"/>
  </svg>

  <!-- Detection coverage line (green) -->
  <svg style="position: absolute; width: 100%; height: 100%;">
    <path d="M 0 280 Q 10% 100, 20% 20 L 100% 10" stroke="var(--dk-positive)" stroke-width="3" fill="none" opacity="0.8"/>
  </svg>

  <!-- Operating point marker -->
  <div style="position: absolute; left: 10%; bottom: 75%; width: 20px; height: 20px; background: var(--dk-accent); border-radius: 50%; border: 3px solid var(--dk-bg); box-shadow: 0 0 0 2px var(--dk-accent), 0 0 16px rgba(217,119,6,0.4);"></div>
  <div style="position: absolute; left: 12%; bottom: 77%; font-size: 15px; color: var(--dk-accent-light); font-weight: 700; white-space: nowrap;">10% sampling</div>
  <div style="position: absolute; left: 12%; bottom: 70%; font-size: 14px; color: var(--dk-text-secondary); white-space: nowrap;">$6,000/month, 95% coverage</div>

  <!-- X-axis label -->
  <div style="position: absolute; bottom: -24px; left: 0; font-size: 14px; color: var(--dk-text-muted);">0%</div>
  <div style="position: absolute; bottom: -24px; right: 0; font-size: 14px; color: var(--dk-text-muted);">100%</div>
  <div style="position: absolute; bottom: -40px; left: 50%; transform: translateX(-50%); font-size: 14px; color: var(--dk-text-secondary);">Sampling Rate</div>
</div>

<!-- Legend -->
<div style="margin-top: 16px; display: flex; gap: 32px; justify-content: center;">
  <div style="display: flex; align-items: center; gap: 8px;">
    <div style="width: 24px; height: 3px; background: var(--dk-negative);"></div>
    <span style="font-size: 14px; color: var(--dk-text-secondary);">Platform cost</span>
  </div>
  <div style="display: flex; align-items: center; gap: 8px;">
    <div style="width: 24px; height: 3px; background: var(--dk-positive);"></div>
    <span style="font-size: 14px; color: var(--dk-text-secondary);">Detection coverage</span>
  </div>
</div>
Feasible: meets detection + cost constraints

Tiered retention (7d/90d/indefinite) cuts costs 60% with minimal loss

Retention Strategy Traces stored Monthly cost Information preserved
90-day full retention (10% sampled) 300K traces/month $6,000 100% — all sampled traces
Tiered: 7d full detail / 90d summary stats / forever: trends only ~120K trace-equivalents $2,400 95% — forensic detail for 7d, metrics for 90d, trends indefinitely
60% cost savings, <5% information loss
Forensic debugging needs are highest in first 7 days — after that, aggregated metrics suffice

Design a dynamic sampling policy with 3 risk-triggered conditions

Condition 1: Deployment
48h at 50% after v2 deploy
Condition 2: Error Spike
24h at 25% when error rate >5%
Condition 3: Seasonal Peak
12d at 15% during Q4 surge (Dec 20-31)
Each condition has 6 fields: trigger, temporary_rate, duration, rampdown, cost_estimate
Extend version (DS/Eng, +15-20 min)
Advanced version: log more from risky situations — like complex multi-step conversations or when executives are using the tool — and less from simple one-question lookups. This focuses your budget where failures are most likely and most expensive.

Sampling Strategy Specification: production-ready config for your infra team

Base sampling rate: 10%
Detection coverage table
(all modes 90%+)
Cost breakdown
$6,000/month base + $4,500 dynamic
Dynamic sampling rules
(3 conditions)
Tiered retention
(7d/90d/indefinite)
Operational requirements
(async queue, backpressure)
Review cadence
(quarterly or after traffic changes)
This document controls your largest observability cost lever

Sampling without detection analysis leaves critical failures invisible

Without detection analysis
• Pick 5% sampling (arbitrary)
• 0.5% failure occurs 50 times/day
• Sample captures 2-3 instances
• Result: Not enough for 90% confidence
Critical failure invisible
With detection analysis
• Calculate 500 samples needed for 0.5% failure
• Set 8-10% sampling to capture 500+
• Sample captures 40-50 instances
• Result: 90%+ detection confidence
Failure detected, debuggable

0.3% failure + 5% budget: sample uniform, stratify, increase budget, or accept blindness?

Scenario:
• 10 failure modes (9 at 5-10%, one at 0.3%)
• Need 90% confidence to detect all
• Cost budget allows 5% sampling max
(a) Sample 5% uniformly
Simple, won't detect 0.3%
(b) Stratified sampling — log more of the risky queries
Oversample rare queries
(c) Increase budget
Afford higher sampling
(d) Accept 0.3% invisible
Focus on other nine
Which do you recommend? Justify.

The triangle: evidence, cost, operations — feasible strategy at center

Evidence Sufficiency Cost Constraints Operations Feasible Strategy
Strategy Evidence Cost Operations
Full logging (100%) High Fail ($60K+) Simple
10% uniform Medium (90%+) Pass ($6K) Simple
Dynamic 5-25% Medium (90%+) Pass ($10.5K) Complex

Next: Grounding evaluation in user value

You've built metrics and instrumentation.
Now: map them to what users actually care about.
AI Analyst Lab | AI Evals for Product Dev | Week 2 Lesson 6 | aianalystlab.ai