Interference-aware design

Week 5 Lesson 4 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

What assumption does user-level randomization rely on, and when might that assumption break?

INDEPENDENT USERS
Each user's outcome depends only on their own treatment assignment
SHARED RESOURCE POOL
Users connected to central cache node — one user's treatment affects others

Your cache helps performance but breaks your experiment

User A (v2)
t=0s, queries "Q4 revenue"
->
Cache populated
v2 schema, 250ms latency
->
User B (v1 control)
t=30s, "Q4 revenue by region"
->
Cache hit (v2)
50ms — control got v2 benefit
Control users got v2 benefits — effect looks smaller than it really is
Cache TTL: 60s. Control users benefit from treatment users' improved retrievals without doing new retrieval work.

SUTVA assumes one user's outcome depends only on their own treatment

SUTVA holds
User A's outcome depends on User A's treatment — not on anyone else's assignment
SUTVA violated
User A's outcome depends on whether Users B, C, D are in treatment or control
Stable Unit Treatment Value Assumption (SUTVA)
The assumption that makes standard A/B testing work. A 'unit' is one user in your experiment. SUTVA says each user's outcome depends only on their own treatment, not on what treatment other users got. When SUTVA breaks, standard A/B testing produces biased estimates.

Shared resources, network effects, temporal carryover violate SUTVA

Mechanism Example Interference direction
Shared resources AI Data Analyst cache: v2 users populate cache, v1 users benefit Control benefits from treatment → you underestimate the effect
Network effects Recommendation system: treatment improves collaborative filtering for control Control benefits from treatment → underestimation
Temporal carryover (effects persist over time) Model retrains nightly using all users' data Treatment leaks into control over time

Cache spillover underestimates v2's true impact by 1.9pp

Standard A/B
+2.9pp
CI: [-0.3, +6.1] (crosses zero — not significant)
Control contaminated by cache
Switchback
+4.8pp
CI: [+0.8, +8.8] (excludes zero — significant)
Control uncontaminated
Bias magnitude
1.9pp
Cache spillover cost
Decision changes from "hold for more data" to "ship with confidence"
The bias wasn't a rounding error — it flipped the ship decision. Standard A/B failed because SUTVA was violated.

Switchback eliminates shared-resource interference by switching everyone at once

Monday
All users: v1
Tuesday
All users: v2
Wednesday
All users: v1
Thursday
All users: v2
Friday
All users: v1
No simultaneous treatment and control → no cache contamination
On v2 days, everyone sees v2 and the cache reflects v2 retrievals. On v1 days, everyone sees v1 and the cache reflects v1 retrievals. Treatment effect estimate is unbiased.

Day-to-day correlation inflates standard errors

30 DAYS OF DATA
Raw time periods observed
30
~18 INDEPENDENT OBSERVATIONS
After accounting for day-to-day correlation (0.42)
18
Why your sample size is smaller than it looks
Consecutive days share ~42% of their variance — they're not fully independent. This makes your confidence intervals wider. You need to run the experiment longer than a standard A/B test — but the estimate is valid.

Cluster by how users share resources, not randomly

RANDOM USER GROUPS
Clusters don't align with how users share resources
Doesn't reduce interference
ALIGN WITH SHARED RESOURCES
Clusters by geography capture regional inventory sharing
Example: SF, Chicago, Denver markets
Example: Marketplace with regional inventory → cluster by market
All SF users get v2, all Chicago users get v1. Requires 20-30 clusters for adequate power (DoorDash). If you only have 5 markets, cluster randomization won't work — standard errors too large.

If SUTVA holds, use standard A/B — it's simpler and more powerful

Can one user's treatment
affect another's outcome?
No → Standard A/B
Narrower CIs, easier to detect effects
Yes → Diagnose interference
Spatial, temporal, or both?
Spatial
Switchback or Cluster
Temporal
Switchback + washout
Both
Hybrid design

Will switchback show larger, smaller, or same effect as standard A/B?

Context
The v1-to-v2 change improved retrieval quality. You're about to see results from a switchback experiment with daily switching over 30 days. The standard A/B test from L5.3 showed +2.9pp [95% CI: -0.3, +6.1].
Write your prediction and reasoning before continuing.

Switchback reveals +4.8pp effect — standard A/B missed 1.9pp due to contamination

Design Effect Size 95% CI p-value Valid?
Standard A/B (L5.3) +2.9pp [-0.3, +6.1] 0.07 SUTVA violated
Switchback (this lesson) +4.8pp [+0.8, +8.8] 0.02 Valid
Cache contamination created 1.9pp downward bias
The valid estimate changes the decision from 'hold for more data' to 'ship with confidence.' What changed is not the system, the users, or the metrics — what changed is the validity of the estimate.

v1 days following v2 show +1.2pp carryover — acceptable for daily switching

v1 days after v1 day
68.3%
SQL success rate
v1 days after v2 day
69.5%
SQL success rate
Carryover (leftover v2 benefit)
+1.2pp
Small vs +4.8pp effect
  • Carryover magnitude: +1.2pp (small relative to +4.8pp effect)
  • Carryover duration: <60s (cache time-to-live is 60 seconds)
  • Decision: Daily switching acceptable

Run autocorrelation check, compute switchback effect, test carryover, design marketplace clusters

Day-to-day correlation check
Check if consecutive days are correlated (reduces effective sample size) (Base: 5 min)
📊
Switchback analysis
Run switchback_analysis(), record effect + CI (Base: 8 min)
🔄
Carryover test
Filter v1 days by prior treatment, compare means (Base: 7 min)
🗂
Cluster design
Define clusters for marketplace interference, validate quality (Base: 10 min)
Extend version
Implement resampling method that respects time structure (blocked bootstrap), compute effective sample size, test different carryover from v1→v2 vs v2→v1 (+15 min)

Interference-Aware Experiment Design Brief changes ship decision from hold to ship

📋
System: AI Data Analyst v1-to-v2
SUTVA Diagnostic: Shared cache violates independence
Recommended Design: Switchback (daily switching)
Treatment Effect: +4.8pp [+0.8, +8.8] | Decision: Ship with confidence
Portfolio-ready artifact — decision changes from hold to ship

Assuming SUTVA holds by default is the costliest mistake

DEFAULT ASSUMPTION
Users are independent — standard A/B without SUTVA check
1.9pp bias → wrong decision
SUTVA DIAGNOSTIC FIRST
System architecture → identify shared resources → choose design
Valid estimate → ship
Before designing any experiment: Can one user's treatment affect another's outcome?
Walk through the system architecture. Identify shared resources, shared models, shared caches, network connections, temporal dependencies. If you find any, SUTVA is suspect. Run the diagnostic. The valid estimate is worth the extra complexity.

Model retrains nightly using all users' data — why does this violate SUTVA?

Scenario
Your AI system has a recommendation model that retrains nightly using all users' click data from the previous day. You run a standard A/B test where treatment users get v2 (new ranking algorithm) and control users get v1. Why does this violate SUTVA?
A. Treatment users' improved clicks feed the model that control users see the next day ✓
B. The model is too complex for A/B testing
C. Control users see v2 rankings by mistake
D. The sample size is too small

2x2 matrix: interference type × randomization design

Spatial (users share resources at the same time) Temporal (effects carry over time)
User-level (standard A/B) Biased estimate
Example: Marketplace with shared supply
Biased estimate
Example: Online learning model
Time/space-level (switchback, cluster) Valid estimate
Example: Shared cache, daily switching
Valid estimate
Example: Regional supply, cluster by geography
SUTVA check first: Can one user's treatment affect another's outcome?
If yes, move to bottom row. Diagnose interference type, choose design that matches mechanism.

Next: Launch Readiness

Pre-launch checklist
Metric validation, guardrail thresholds, rollback triggers
📈
Progressive rollout
Ramp schedule, monitoring plan, decision gates
📊
Production monitoring
What to watch, how often, when to rollback
Build a launch readiness brief your team can use for every AI feature ship
AI Analyst Lab | AI Evals for Product Dev | Week 5 Lesson 4 | aianalystlab.ai