Signals to actions

Action Surface	What Changes	Example for AI Data Analyst
Prompts	System prompt, few-shot examples	Add conciseness constraints to narrative prompt
Retrieval (context fetching)	Query reformulation, ranking	Adjust ranking threshold to reduce latency
Model config	Model selection, temperature, max tokens	Switch to smaller model for simple queries
UX constraints	Input validation, output filtering	Add 'Show SQL' toggle instead of auto-verbose explanations
Guardrails	Refusal logic, confidence thresholds	Ask for clarification below confidence threshold
Data quality	Rubric calibration, oracle expansion	Add readability dimension to narrative judge

Findings-to-Actions Plan with 3 prioritized interventions and validation cascades

Findings-to-Actions Plan

Divergence Summary

Edit rate spike (35% → 51%) revealed narrative verbosity missed by judges — 3 failure modes addressed

<div>
  <div style="font-weight: 700; color: var(--dk-text); margin-bottom: 4px;">Top 3 Priorities</div>
  <div style="color: var(--dk-text-secondary); line-height: 1.5;">
    (1) Priority 16: Add conciseness constraints<br>
    (2) Priority 12: Add readability rubric dimension<br>
    (3) Priority 9: Switch to smaller model for simple queries
  </div>
</div>

<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 12px;">
  <div>
    <div style="font-weight: 700; color: var(--dk-text); margin-bottom: 2px;">Validation Cascades</div>
    <div style="color: var(--dk-text-secondary);">4 stages each with pass criteria</div>
  </div>
  <div>
    <div style="font-weight: 700; color: var(--dk-text); margin-bottom: 2px;">Metric System Changes</div>
    <div style="color: var(--dk-text-secondary);">Add edit_rate + readability dimension to judge rubric</div>
  </div>
</div>

Share-ready artifact: what you fix next, how you'll know it worked

Signals to actions

What piece of evidence was weakest or missing that would have increased your confidence in the ship decision?

Users complain "the system takes forever now" but all blocking metrics are green

When behavioral signals regress but eval metrics pass, trust the users first

Edit rate, reformulation rate, abandonment rate, escalation rate reveal what metrics miss

The 2x2 matrix identifies when metrics are incomplete, not when the system is broken

Six categories of changes you can make to an AI system, not just prompts

"If we [change], we expect [metric] to move by [amount] because [mechanism]"

(Impact × Confidence) / Effort ranks interventions under resource constraints

Staged testing with pass criteria at each stage prevents "we tried 15 things"

Edit rate jumped from 35% to 51%, but SQL success rate held steady — what did the metrics miss?

18% of traces show signal-metric divergence — high edit rate, passing judge scores

One divergence trace reveals the gap: narrative is verbose jargon, judge doesn't penalize readability

Action mapping produces a testable hypothesis with validation plan and priority score

Detect divergence in v2 traces, map to interventions, prioritize, propose metric system changes

Findings-to-Actions Plan with 3 prioritized interventions and validation cascades

Interventions without testable hypotheses waste engineering time on aimless experimentation

What do you do when edit_rate spikes but metrics are steady?

Divergence detection → action mapping → prioritization → validation cascade

Next: Prioritization and iteration using evaluation evidence