Grounding in User Value

Correlation strength	r range	Decision authority	Example
Strong	r > 0.7	Blocking metric (must pass to ship)	Narrative actionability (r=0.82) with satisfaction
Moderate	0.5 < r < 0.7	Optimization metric (track and improve, don't gate)	Result completeness (r=0.64) with satisfaction
Weak	r < 0.5	Feedback-only (useful for debugging, not ship decisions)	SQL success (r=0.31) with satisfaction

The metric graveyard: latency, length, and tokens predict nothing

Metric	Correlation with satisfaction	Confidence Range	Verdict
Retrieval latency (ms)	r = 0.08	[-0.06, 0.22]	Dead — no meaningful correlation
Narrative length (chars)	r = 0.12	[-0.02, 0.25]	Dead — longer ≠ better
Total tokens consumed	r = -0.03	[-0.17, 0.11]	Dead — cost has no user signal

These metrics aren't worthless — they inform infrastructure and budgets. But none predict whether users are satisfied.

Note: These r-values are all near zero — no relationship. The confidence ranges all cross zero, which means we can't even be sure the tiny correlations aren't just noise.

Grounding in User Value

Which v1 signals enabled quality measurement impossible in v0?

Technical quality without user value connection is measurement theater

One-third of GenAI users hit incorrect answers despite high test accuracy

Leading metrics are proxies — lagging metrics are outcomes

Business metrics are the slowest, highest-confidence layer

Validate the connection with correlation studies, agreement checks, signal tracking, and audits

Correlation strength determines decision authority

Proxy theater: measurement rigor mistaken for decision relevance

Which metric correlates most with satisfaction?

The metric graveyard: latency, length, and tokens predict nothing

Narrative actionability (r=0.82) beats SQL success (r=0.31)

One trace shows why user-facing quality predicts satisfaction

Build correlation table + leading-to-lagging map with validation plan

Validated metric map with r=0.82 correlation — release gate evidence

Five grounding failure modes

Can you ship at 92% SQL correctness?

Leading-to-lagging chain with validation loop

Next: Ground truth sources and regression suite construction