Signals to actions

Week 6 Lesson 2 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

What piece of evidence was weakest or missing that would have increased your confidence in the ship decision?

Users complain "the system takes forever now" but all blocking metrics are green

MONITORING DASHBOARD
SQL success rate: 79% (green)
Judge scores: 0.81 (green)
Latency: Within SLA (green)
Hallucination rate: Stable (green)
USER SIGNALS
Support tickets: ×3 (red)
User complaint: "System takes forever now"
User complaint: "Why does it keep asking me to clarify obvious queries?"
Signal-metric divergence
User behavioral signals indicate failure while evaluation metrics report success

When behavioral signals regress but eval metrics pass, trust the users first

Core Diagnostic Rule
When signals and metrics conflict, trust the users first. Then investigate why your metrics missed the problem.
Signal-metric divergence is primarily a metric problem, not a system problem.

Edit rate, reformulation rate, abandonment rate, escalation rate reveal what metrics miss

Edit rate
Fraction of outputs users modify
Reformulation rate
Fraction of queries followed by clarifying follow-up
Abandonment rate
Fraction of sessions ending without task completion
Escalation rate
Fraction of interactions requesting human support
Instrument behavioral signals as first-class metrics, not afterthoughts.

The 2x2 matrix identifies when metrics are incomplete, not when the system is broken

Eval Metrics: Pass
Eval Metrics: Fail
Behavioral Signals: High
DIVERGENCE
Metrics Lagging
System Broken
Behavioral Signals: Low
System OK
Metrics Sensitive
→ Trust users, investigate metric gaps

Six categories of changes you can make to an AI system, not just prompts

Action Surface What Changes Example for AI Data Analyst
Prompts System prompt, few-shot examples Add conciseness constraints to narrative prompt
Retrieval (context fetching) Query reformulation, ranking Adjust ranking threshold to reduce latency
Model config Model selection, temperature, max tokens Switch to smaller model for simple queries
UX constraints Input validation, output filtering Add 'Show SQL' toggle instead of auto-verbose explanations
Guardrails Refusal logic, confidence thresholds Ask for clarification below confidence threshold
Data quality Rubric calibration, oracle expansion Add readability dimension to narrative judge

"If we [change], we expect [metric] to move by [amount] because [mechanism]"

"If we [change], we expect [metric] to improve by [amount] because [mechanism], validated via [method]."
NOT A HYPOTHESIS
Rewrite the prompt.
HYPOTHESIS
If we add conciseness constraints to the narrative prompt, we expect edit_rate to decrease by 10-15% because users will need fewer edits to verbose outputs, validated via re-running the system on 200 saved examples.

(Impact × Confidence) / Effort ranks interventions under resource constraints

Priority = (Impact × Confidence) / Effort
Impact (1-5)
How much does this fix the problem?
Confidence (1-5)
How sure are you this will work?
Effort (1-5)
How much time and complexity?
#1 Priority score: 16
#2 Priority score: 12
#3 Priority score: 9
Use learning value as tiebreaker.

Staged testing with pass criteria at each stage prevents "we tried 15 things"

Test set eval (before production)
metric X improves by Y, no regression >0.05
Shadow mode (live traffic, no user impact)
no latency spike >5%, no new failures
Limited experiment
10% exposure, primary metric improves
Full rollout
sustained improvement over 7 days
Rollback trigger
If any stage fails, roll back the change and investigate

Edit rate jumped from 35% to 51%, but SQL success rate held steady — what did the metrics miss?

You shipped v2. Within 48 hours:
• Edit rate: 35% → 51%
• Abandonment rate: 12% → 19%
• SQL success rate: steady at 79%
• Judge scores: unchanged at 0.81
What is the most likely explanation for this pattern?
What did the metrics miss?
Write your hypothesis before continuing.

18% of traces show signal-metric divergence — high edit rate, passing judge scores

NORMAL CLUSTER (82%)
edit_rate: 0.15 – 0.35
judge_score: 0.72 – 0.88
164 / 200 traces — metrics and users agree
vs
DIVERGENCE CLUSTER (18%)
edit_rate: 0.50 – 0.90
judge_score: 0.70 – 0.92
36 / 200 traces — judge says good, users disagree
Cross-tab of edit_rate vs judge_score across 200 v2 traces

One divergence trace reveals the gap: narrative is verbose jargon, judge doesn't penalize readability

WHAT THE JUDGE SAW
☑ Factual content accurate
☑ All relevant data points included
☑ SQL and chart correct
Score: 0.85
WHAT THE USER SAW
"The mobile signup funnel exhibited a conversion rate trajectory with progressive attrition across sequential engagement touchpoints, culminating in a terminal conversion rate of 12.3%..."
87 words of jargon when user wanted:
"Mobile signup conversion was 12.3% last month."
Judge measures faithfulness and completeness but not readability. Metric is incomplete, not wrong.

Action mapping produces a testable hypothesis with validation plan and priority score

FAILURE MODE
Narrative verbosity
ACTION SURFACE
Prompts + Data quality
HYPOTHESIS
If we add conciseness constraints to the narrative prompt, we expect edit_rate to decrease by 10-15% because users will need fewer edits, validated via re-running the system on 200 saved examples.
VALIDATION PLAN
(1) Test set eval: edit_rate drops to <0.35
(2) Shadow mode: no latency spike
(3) Limited experiment: edit_rate 10-15% lower in treatment
(4) Full rollout: sustained improvement over 7 days
PRIORITY SCORE
(4 × 4) / 1 = 16

Detect divergence in v2 traces, map to interventions, prioritize, propose metric system changes

  • Compute behavioral signals for v2 traces — edit_rate, reformulation_rate, abandonment_rate
  • Detect divergence cases — filter to high signals + passing judge scores
  • Drill into 2-3 divergence examples — write one-sentence hypothesis per example
  • Propose interventions — hypothesis, impact/confidence/effort scores, validation plan
  • Prioritize — compute (impact × confidence) / effort, rank, identify top 3
  • Propose metric system changes — add new metric dimension or recalibrate judge rubric
  • Assemble Findings-to-Actions Plan artifact

Findings-to-Actions Plan with 3 prioritized interventions and validation cascades

Findings-to-Actions Plan
Divergence Summary
Edit rate spike (35% → 51%) revealed narrative verbosity missed by judges — 3 failure modes addressed
<div>
  <div style="font-weight: 700; color: var(--dk-text); margin-bottom: 4px;">Top 3 Priorities</div>
  <div style="color: var(--dk-text-secondary); line-height: 1.5;">
    (1) Priority 16: Add conciseness constraints<br>
    (2) Priority 12: Add readability rubric dimension<br>
    (3) Priority 9: Switch to smaller model for simple queries
  </div>
</div>

<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 12px;">
  <div>
    <div style="font-weight: 700; color: var(--dk-text); margin-bottom: 2px;">Validation Cascades</div>
    <div style="color: var(--dk-text-secondary);">4 stages each with pass criteria</div>
  </div>
  <div>
    <div style="font-weight: 700; color: var(--dk-text); margin-bottom: 2px;">Metric System Changes</div>
    <div style="color: var(--dk-text-secondary);">Add edit_rate + readability dimension to judge rubric</div>
  </div>
</div>
Share-ready artifact: what you fix next, how you'll know it worked

Interventions without testable hypotheses waste engineering time on aimless experimentation

WITHOUT HYPOTHESIS
Timeline: 3 months
15 interventions deployed
→ "some number moved"
→ no basis for deciding whether it worked
WITH HYPOTHESIS
3 interventions
Each with validation cascade:
→ testable hypothesis
→ staged validation
→ evidence-based decision
Writing a hypothesis takes 2 minutes. It saves weeks of aimless experimentation.

What do you do when edit_rate spikes but metrics are steady?

Your monitoring dashboard shows SQL success rate at 79% (steady), judge scores at 0.81 (steady), but edit_rate jumped from 38% to 54% and support tickets doubled.
A colleague says: "The metrics are fine, users are just complaining more."
What is wrong with this reasoning?
What should you do?

Divergence detection → action mapping → prioritization → validation cascade

Divergence Detection
2×2 matrix
DIVERGENCE quadrant
High signals + passing metrics
Trust users, investigate metric gaps
Action Mapping
6 action surfaces:
• Prompts
• Retrieval
• Model config
• UX constraints
• Guardrails
• Data quality
Map failure modes to interventions
Prioritization & Validation
Ranked by priority score:
#1: Score 16
#2: Score 12
#3: Score 9
4-stage validation cascade
Rollback if any stage fails

Next: Prioritization and iteration using evaluation evidence

Map the full evaluation backlog, rank improvement work, and decide what to instrument next.
AI Analyst Lab | AI Evals for Product Dev | Week 6 Lesson 2 | aianalystlab.ai