Week 4, Lesson 5: Driver analysis

Week 4 · Lesson 5 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

Which dimension did you prioritize highest for evaluation (user role, query complexity, or domain) and why?

User Role
analyst, executive, power_user
Priority: ___
Query Complexity
simple, complex
Priority: ___
Domain
sales, inventory, customer
Priority: ___

Which did you rank first in L4.4?

74% SQL correctness passes your 70% threshold — but should you ship or invest two more weeks?

Ship now
✓ 74% > 70% blocking threshold

? But is that good enough?
Wait 2 weeks
? Will improvement move the aggregate?

? Which segment should we fix?

Without decomposition, you either over-invest in tiny segments or ignore fixable problems in critical ones

Over-invest
Multi-hop reasoning: 40% correctness
2% of queries → impact score 0.007
Under-invest
Complex queries: 65% correctness
30% of queries → impact score 0.027
(4x higher)

Driver analysis decomposes overall quality into segment contributions to find which improvements move the needle

Before
Overall SQL correctness: 74%
?
After
Simple: 90%
Complex: 58%
→ Impact score: 8.0

Performance impact score = (baseline metric - segment metric) × segment frequency

impact_score = (baseline - segment_metric) × frequency
Segment Correctness Frequency Impact Score
Simple queries 90% 50% -8.0
Complex queries 58% 50% 8.0

Complex segment accounts for 8 percentage points of drag on the aggregate.

Three-step workflow: segment and measure, rank by impact, generate intervention hypotheses

1. Segment & measure
Partition by dimension, compute metric per segment, flag N < 30
2. Rank by impact
Calculate impact scores, sort descending
3. Generate hypotheses
For top segments: root cause, intervention, validation

Every finding is a hypothesis that requires validation.

Driver analysis identifies correlation and association — not causation — every finding is a hypothesis

Week 5 experimentation validates the hypotheses you generate in Week 4.

You have 500 SQL queries (250 simple, 250 complex). Overall correctness is 74%. What will you find?

Field 1
Simple query correctness: ___%
Field 2
Complex query correctness: ___%
Field 3
Driver conclusion: [even distribution / complexity drives errors]

Write your prediction before advancing.

Simple queries: 90% correctness. Complex queries: 58% correctness. A 16-point gap from baseline drives the highest impact score

90%
Simple queries
95% CI: [86%, 93%]
N=250
58%
Complex queries
95% CI: [52%, 64%]
N=250
16-point gap from baseline × 50% frequency = 8.0 impact score (32pp simple-vs-complex)

When segmented by domain, inventory queries underperform — but is that the real driver?

Segment Correctness N Note
Sales 78% 170
Inventory 65% 160 Lowest domain
Customer 80% 170

Single-dimension analysis suggests inventory is the problem. But is that because inventory queries are inherently harder, or because inventory users ask more complex questions?

Complexity hits harder in the inventory domain — that's an interaction effect

Segment Correctness N Note
Sales+Simple 88% 85
Sales+Complex 70% 85
Inventory+Simple 85% 80
Inventory+Complex 45% 80 Combined effect
Customer+Simple 92% 85
Customer+Complex 68% 85

Segment traces by domain, compute impact scores, check for interaction effects, and write intervention hypotheses

  • Run worked example on SQL correctness by complexity
  • Segment traces by domain (sales, inventory, customer)
  • Compute impact scores for each domain
  • Check whether complexity driver holds within each domain
  • Complete intervention hypothesis template for worst segment
  • Prioritize top 2 segments by expected impact
  • Run verify cell

Extend: Driver analysis on narrative_quality by user_role, generate 2+ intervention hypotheses, cross-metric comparison plot, time-series decomposition (advanced)

A driver analysis brief identifying complex inventory queries as the highest-impact improvement opportunity

AI Data Analyst v1 Driver Analysis — SQL Correctness
58%
Complex
impact: 8.0
45%
Complex+Inventory
impact: 4.4
Worst finding: Complex+Inventory at 45% (interaction effect)
Complex queries drive the most aggregate drag (8.0). The interaction with inventory domain (45%) pinpoints where to aim the fix.

Over-segmentation creates unreliable estimates. Flag segments below 30 samples as insufficient for action

Failure Mode What Goes Wrong Mitigation
Over-segmentation 20+ slices, some N < 10, confidence intervals too wide to trust Flag N < 30 as unreliable, merge sparse segments
Confusing correlation with causation Executives have lower quality — but they ask harder questions Check interaction effects, segment by dimension pairs
Ignoring segment frequency Multi-hop at 40% but only 2% of queries Rank by impact score (gap × frequency), not gap alone
Static analysis on shifting distributions Month-old data, user behavior changed Time-series decomposition to check stability

A system hits 82% aggregate. Simple queries: 95% (N=300). Complex: 60% (N=200). Should you delay ship?

82%
Overall correctness
Segment Correctness N Impact Score
Simple 95% 300 -13
Complex 60% 200 +22

PM asks: Should we delay ship to improve complex query handling?

What evidence supports your recommendation? What additional evidence do you need?

Driver analysis decision tree: from segmented evaluation data to prioritized intervention list

Evaluation dataset
Segment labels + quality metric (SQL correctness = 74%)
Step 1: Segment and measure
Partition by dimension, compute metric, calculate 95% CI
Sample size >= 30?
No → Merge segments or collect more data
Step 2: Rank by impact
Calculate impact scores, sort descending
Step 3: Generate hypotheses
For top 3 segments: root cause, intervention, validation

Week 5 experimentation → Validate hypothesis → Implement intervention → Re-run driver analysis

Next: Metric specifications, thresholds, baselines, and release criteria

  • What counts as good enough?
  • What baseline should we compare against?
  • When do we block a release?
AI Analyst Lab | AI Evals for Product Dev | Week 4 Lesson 5 | aianalystlab.ai