Grounding in User Value
Week 3 Lesson 1 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab
Which v1 signals enabled quality measurement impossible in v0?
v0 Instrumentation
Query ID
SQL text
Timestamp
v1 Instrumentation
User edit occurred
Session abandoned
Satisfaction rating
Follow-up count
Which signals unlocked user value measurement?
Technical quality without user value connection is measurement theater
SQL Correctness
92%
Retrieval Accuracy
87%
Hallucination Rate
4.2%
PM: "Great — but do users actually find this useful?"
You do not know.
One-third of GenAI users hit incorrect answers despite high test accuracy
25-40%
of enterprise GenAI users report encountering incorrect or misleading outputs
Industry surveys, 2024-2025
Test accuracy: High
✓
No validated connection
Gap
User experience: Frustrated
✗
Leading metrics are proxies — lagging metrics are outcomes
L1
Leading Metrics
Fast · Offline · Proxies
L2
Lagging Metrics
Slow · Production · Outcomes
Examples
SQL correctness
Retrieval accuracy
Hallucination rate
Examples
Edit rate
Abandon rate
Satisfaction
Business metrics are the slowest, highest-confidence layer
L1
Leading Metrics
Fast · Offline · Proxies
L2
Lagging Metrics
Slow · Production · Outcomes
L3
Business Metrics
Slowest · Aggregate · Impact
Examples
SQL correctness
Retrieval accuracy
Hallucination rate
Examples
Edit rate
Abandon rate
Satisfaction
Examples
Analyst time saved
Decision confidence
Support costs
Validate the connection with correlation studies, agreement checks, signal tracking, and audits
Correlation Studies
Check whether a technical metric actually tracks with user behavior
Test-to-Production Agreement
Check if systems that score well in testing also perform well with real users
User Signal Tracking
Capture edits, abandons, follow-ups alongside eval metrics
Periodic Re-Validation
Quarterly checks as system and users evolve
Correlation strength determines decision authority
Correlation strength
r range
Decision authority
Example
Strong
r > 0.7
Blocking metric
(must pass to ship)
Narrative actionability (r=0.82) with satisfaction
Moderate
0.5 < r < 0.7
Optimization metric
(track and improve, don't gate)
Result completeness (r=0.64) with satisfaction
Weak
r < 0.5
Feedback-only
(useful for debugging, not ship decisions)
SQL success (r=0.31) with satisfaction
A blocking metric must have r > 0.7 with at least one lagging metric.
Note: r measures how strongly two metrics move together. 1.0 = perfect, 0 = no relationship.
Proxy theater: measurement rigor mistaken for decision relevance
"The metric improved. User satisfaction did not move."
Retrieval latency
150ms → 80ms
Dashboard: Green ✓
User satisfaction
3.2 → 3.2
No change
You optimized a proxy that doesn't predict value.
Which metric correlates most with satisfaction?
Metric A: SQL execution success rate
Does the query run without errors?
Metric B: Result set completeness
Does the query return all relevant rows?
Metric C: Narrative actionability
Does the summary help the user make a decision?
Write your predictions before looking at the data.
The metric graveyard: latency, length, and tokens predict nothing
Metric
Correlation with satisfaction
Confidence Range
Verdict
Retrieval latency (ms)
r = 0.08
[-0.06, 0.22]
Dead — no meaningful correlation
Narrative length (chars)
r = 0.12
[-0.02, 0.25]
Dead — longer ≠ better
Total tokens consumed
r = -0.03
[-0.17, 0.11]
Dead — cost has no user signal
These metrics aren't worthless — they inform infrastructure and budgets. But none predict whether users are satisfied.
Note: These r-values are all near zero — no relationship. The confidence ranges all cross zero, which means we can't even be sure the tiny correlations aren't just noise.
Narrative actionability (r=0.82) beats SQL success (r=0.31)
Narrative actionability · r = 0.82
[0.76, 0.87]
Result set completeness · r = 0.64
[0.54, 0.72]
SQL execution success · r = 0.31
[0.17, 0.44]
r = 0.7 — blocking metric threshold
User-facing signals predict satisfaction better than pipeline internals.
One trace shows why user-facing quality predicts satisfaction
Trace t_0042
SQL execution success
1 ✓
Result set completeness
0.70
Narrative actionability
0.90
User satisfaction
4/5 ★
SQL ran without errors, but missed some data. Despite incomplete results, narrative was clear and actionable. User gave it 4/5.
Narrative actionability captures what the user experienced — a helpful answer despite imperfect underlying data.
Build correlation table + leading-to-lagging map with validation plan
Base version
1. Extract user signals (edit_occurred, session_abandoned) from 200 traces
2. Compute 4 correlations: retrieval accuracy vs edits/abandons, chart rendering vs edits/abandons
3. Identify better proxy + fill Leading-to-Lagging Metric Map
Extend version
• Implement Pearson r + bootstrap CI from scratch
• Add power calculations: sample size for 80% power at α=0.05
Validated metric map with r=0.82 correlation — release gate evidence
r = 0.82
95% CI [0.76, 0.87]
"I built a leading-to-lagging metric map that connects narrative actionability to user satisfaction with validated correlation r = 0.82, and identified SQL execution success (r = 0.31) as feedback-only — not a release gate."
Release gate threshold
r > 0.7
Five grounding failure modes
✗
Assume validity without validation
Ship based on SQL correctness (r=0.31), users abandon at 25%
✗
Optimize proxy theater
Cut latency 150ms→80ms, satisfaction doesn't move
✗
Test-vs-production gap
Test set: 95%, production users: 70% satisfied
✗
Stale metrics
Q1: hallucination matters. Q3: users want depth, you gate on hallucination
✗
Confuse metric layers
Use satisfaction (lagging) for rapid iteration → 2-week feedback loops
Can you ship at 92% SQL correctness?
Question 1
A team reports: SQL correctness 92%, retrieval 88%, hallucination 3.8%. PM asks: Should we ship? What question must they answer first?
Question 2
The correlation between your best offline metric and user satisfaction is 0.68 — close to but below your 0.70 threshold for a release gate. The confidence range goes from 0.59 to 0.76 — meaning it could be above or below your threshold. Do you trust this for ship decisions?
Question 3
After launch: SQL correctness 92%, hallucination 4% (stable). But abandon rate increases 12%→19%. What does this divergence signal? What should the team do?
Leading-to-lagging chain with validation loop
Leading Metrics
Fast, offline
SQL correctness 92%
Retrieval accuracy 87%
Hallucination 4.2%
Lagging Metrics
Slow, production
Edit rate 22%
Abandon rate 12%
Satisfaction 3.8/5
Business Metrics
Slowest, aggregate
Analyst time saved
Decision confidence
Support costs
Narrative actionability → Satisfaction:
r = 0.82 (strong)
Result completeness → Abandon rate:
r = 0.64 (moderate)
SQL correctness → Edit rate:
r = 0.31 (weak)
r < 0.5 = feedback-only, not release gates
Next: Ground truth sources and regression suite construction
✓
Gold-standard datasets
⊞
Regression suite design
⚗
Synthetic data for edge cases
Metrics that strongly correlate with user value (r > 0.7) become blocking metrics. Once you know which metrics matter, you invest in gold-standard evaluation for those — not every dimension equally.