Communicating Impact

	Executive	Product	Engineering
Retrieval quality +53%	Users complete complex queries in one turn instead of three	PM/DS users see 70% fewer clarification requests (they get the right answer on the first try); Executive users see 10% fewer	Re-ranker precision@5 (how often the top 5 results are relevant) improved from 0.64 to 0.82 via new retrieval model; +400ms at p95

Overclaimed	Properly Hedged
v2 improves quality by 53%	v2 improved quality by 53% (95% CI [48-58%] — we're 95% confident the true improvement is between 48-58%) for PM/DS users in our 10k-user experiment
Hallucination is significantly reduced	Hallucination rate decreased from 15% to 9% (p < 0.001, highly statistically significant; 5,000 users per group)
Latency slightly increased	Median latency +50ms (within SLA); p95 +400ms, breaching 2s SLA for 12% of queries
The change is positive overall	Net quality improved for 85% of users. Remaining 15% experienced latency degradation with no quality benefit

Percentile	Latency increase (ms)	What it means
p10 (best 10%)	+30ms	Best-case users barely notice
p50 (median)	+50ms	Typical user sees small increase
p90 (worst 10%)	+180ms	10% of users see meaningful increase
p95 (worst 5%)	+400ms	5% of users see dramatic increase, blowing through SLA

Segment	Treatment effect (percentage points)	95% CI
PM	+12pp	10-14%
DS	+9pp	7-11%
Engineering	+11pp	9-13%
Executive	+2pp	-1 to 5% (not statistically significant)

Segment	p95 latency (slowest 5%, in ms)	SLA status (2000ms threshold)
PM	1,800ms	Within SLA
DS	1,850ms	Within SLA
Engineering	1,750ms	Within SLA
Executive	2,400ms	BREACHED

Three panels: claim discipline, distribution reporting, two templates

Panel 1: Claim Discipline Spectrum

Overclaimed: "v2 improves quality" (no evidence, no scope)
Partial: "v2 improved quality by 53% in our experiment" (evidence cited, scope missing)
Full discipline: "v2 improved quality by 53% (95% CI 48-58% — we're 95% confident the true improvement is in this range) for PM/DS users in our 10k-user experiment"

Panel 2: Distribution Reporting Pattern

❌ Latency +110ms → Hides who regresses
✓ p10 (best 10%): +30ms | p50 (median): +50ms | p90: +180ms | p95 (worst 5%): +400ms + Executive segment worst-hit → Shows best-case, typical, and worst-case instead of hiding the problem in the average

Panel 3: Two-Template Structure

Impact Brief (success framing) → Executives / Product
Regression Update Brief (problem framing) → Engineering / On-call
Same evidence, different genre

Communicating Impact

Which prioritization dimension was hardest to assess objectively?

Your evaluation work dies when communication fails

Same evidence, different translation for each stakeholder

Match claim strength to evidence strength

Report segments and tails, not only averages

Which evidence will you lead with in the exec brief?

Treatment effects differ substantially by segment

SLA breach is concentrated in Executive segment

Properly hedged claims with distribution detail

Three versions of the same evidence

Complete the impact brief and regression update brief

Executive impact brief and engineering regression update

Overclaiming destroys credibility

Rewrite overclaimed statements with proper hedging

Three panels: claim discipline, distribution reporting, two templates