Communicating Impact

Week 6 Lesson 6 AI Evals for Product Dev
Shane Butler AI Analyst Lab

Which prioritization dimension was hardest to assess objectively?

Think back to L6.3 — your answer before reading.

Your evaluation work dies when communication fails

BEFORE
6 weeks of evaluation work → One-size-fits-all report → No one acts
AFTER
Same evaluation work → Audience-tailored briefs → Decisions move forward
This lesson closes the gap

Same evidence, different translation for each stakeholder

Executive Product Engineering
Retrieval quality +53% Users complete complex queries in one turn instead of three PM/DS users see 70% fewer clarification requests (they get the right answer on the first try); Executive users see 10% fewer Re-ranker precision@5 (how often the top 5 results are relevant) improved from 0.64 to 0.82 via new retrieval model; +400ms at p95

Match claim strength to evidence strength

Overclaimed Properly Hedged
v2 improves quality by 53% v2 improved quality by 53% (95% CI [48-58%] — we're 95% confident the true improvement is between 48-58%) for PM/DS users in our 10k-user experiment
Hallucination is significantly reduced Hallucination rate decreased from 15% to 9% (p < 0.001, highly statistically significant; 5,000 users per group)
Latency slightly increased Median latency +50ms (within SLA); p95 +400ms, breaching 2s SLA for 12% of queries
The change is positive overall Net quality improved for 85% of users. Remaining 15% experienced latency degradation with no quality benefit

Report segments and tails, not only averages

Percentile Latency increase (ms) What it means
p10 (best 10%) +30ms Best-case users barely notice
p50 (median) +50ms Typical user sees small increase
p90 (worst 10%) +180ms 10% of users see meaningful increase
p95 (worst 5%) +400ms 5% of users see dramatic increase, blowing through SLA
Average: +110ms
Hides the real story

Which evidence will you lead with in the exec brief?

You have the v2 experiment results: retrieval quality +53%, hallucination rate -40%, median latency +50ms, p95 latency +400ms for worst-case users, cost +10%.
Which piece of evidence will you lead with? Write the first sentence of your impact brief.

Treatment effects differ substantially by segment

Segment Treatment effect (percentage points) 95% CI
PM +12pp 10-14%
DS +9pp 7-11%
Engineering +11pp 9-13%
Executive +2pp -1 to 5% (not statistically significant)

SLA breach is concentrated in Executive segment

Segment p95 latency (slowest 5%, in ms) SLA status (2000ms threshold)
PM 1,800ms Within SLA
DS 1,850ms Within SLA
Engineering 1,750ms Within SLA
Executive 2,400ms BREACHED

Properly hedged claims with distribution detail

Executive Impact Brief — Key Results
  • Retrieval precision improved by 53 percentage points (95% CI: we're 95% confident the true improvement is 48-58%) for PM/DS users (6,000 users). Executive users saw 2pp improvement (95% CI: -1 to 5%, not statistically significant — the improvement could be zero or even negative).
  • Hallucination rate decreased from 15% to 9% in treatment group (p < 0.001, highly statistically significant; 5,000 users per group).
  • Median latency +50ms (within SLA). P95 latency (slowest 5% of queries) +400ms overall, breaching 2s SLA for 12% of queries — concentrated in Executive segment (p95=2.4s).
  • Cost per query increased ~10% (95% CI: 8-12%) due to re-ranker token overhead.

Three versions of the same evidence

EXEC
Executive
Our AI Data Analyst now handles complex analytical queries in fewer steps for most users, reducing time from question to insight. We are monitoring a latency tradeoff that affects our Executive user segment and expect to resolve it within two weeks.
PROD
Product
PM and DS users see 70% fewer clarification requests and complete tasks faster. Executive users see minimal quality improvement but experience noticeable latency increases on complex queries. Recommendation: prioritize latency optimization before expanding to Executive segment.
ENG
Engineering
The v2 re-ranker (cross-encoder-ms-marco) improved retrieval precision@5 from 0.64 to 0.82. Re-ranking adds 50ms median latency (400ms at p95). Worst case: complex multi-table join queries in Executive segment trigger sequential re-ranking across 8+ candidates, pushing p95 to 2.4s (SLA threshold: 2.0s).

Complete the impact brief and regression update brief

BASE VERSION (25-30 min)
  1. Complete partial impact brief with claim discipline on latency/cost
  2. Add distribution reporting for latency (percentiles by segment)
  3. Write complete regression update brief for engineering team
EXTEND VERSION (+10-15 min)
  1. Compute segment-specific bootstrap CIs
  2. Implement SLA classification function
  3. Write SQL-style repro queries for worst-case latency

Executive impact brief and engineering regression update

IMPACT BRIEF
Success framing
Key Results section with hedged claims and CIs
REGRESSION UPDATE
Problem framing
Distribution-level latency detail with segment breakdown
Same evidence, different genre → Portfolio-ready artifacts for any shipped change

Overclaiming destroys credibility

What happens
You write 'v2 improves quality by 53%' → CTO asks 'For whom? Based on what? What's the margin of error?' → You lose credibility for this brief and future ones.
The fix
Use claim discipline template on every metric: past tense verb + specific numbers + scope + confidence interval.

Rewrite overclaimed statements with proper hedging

Scenario 1
Your DS colleague says 'v2 dramatically improves retrieval quality' is too strong. What specific changes would make it properly hedged? Your rewrite must include: verb tense (past vs present), specific numbers, which users, and how confident you are.
Scenario 2
Your experiment shows overall quality +8%, but Executive users (10% of population) experienced -3% quality regression. You're writing an impact brief for the CEO. Do you mention the Executive segment regression? How do you frame it?
Scenario 3
You shipped v2 three days ago. Monitoring shows p95 latency spiked from 1.8s to 2.4s, breaching SLA. Support tickets tripled. Your eng manager asks for a regression update. What are the 3 most critical pieces of information to include first? Rank them by urgency.

Three panels: claim discipline, distribution reporting, two templates

Panel 1: Claim Discipline Spectrum
Overclaimed: "v2 improves quality" (no evidence, no scope)
Partial: "v2 improved quality by 53% in our experiment" (evidence cited, scope missing)
Full discipline: "v2 improved quality by 53% (95% CI 48-58% — we're 95% confident the true improvement is in this range) for PM/DS users in our 10k-user experiment"
Panel 2: Distribution Reporting Pattern
❌ Latency +110ms → Hides who regresses
✓ p10 (best 10%): +30ms | p50 (median): +50ms | p90: +180ms | p95 (worst 5%): +400ms + Executive segment worst-hit → Shows best-case, typical, and worst-case instead of hiding the problem in the average
Panel 3: Two-Template Structure
Impact Brief (success framing) → Executives / Product
Regression Update Brief (problem framing) → Engineering / On-call
Same evidence, different genre
AI Analyst Lab | AI Evals for Product Dev | Week 6 Lesson 6 | aianalystlab.ai