Similarity + retrieval metrics

Week 3 Lesson 4 · AI Evals for Product Dev
Shane Butler · AI Analyst Lab

Retrieval quality is a system signal — which signal category does it fall into, and why does that matter for when you measure it in the pipeline?

System signals
Measures internal pipeline state before output generation
Output quality signals
Measures generated output correctness and quality
User behavior signals
Measures user interaction with outputs
Retrieval
System signal measured here
Generation
Output quality measured here

When retrieval fails, every downstream component inherits that failure

Retrieval
Returns wrong context (desktop vs mobile)
SQL Generation
Valid SQL, real tables
Narrative
Faithful to SQL results
Final output: Chart looks plausible, numbers are real, but describes the wrong user segment
End-to-end metrics show SQL execution success = 0.91, narrative faithfulness = 0.78. System looks fine, but the problem is upstream.
Root cause: Retrieval failure masked by downstream success metrics

Similarity metrics measure closeness — retrieval metrics measure relevance

Similarity metrics
How close are two embeddings?
Example: Cosine similarity, Euclidean distance
Use cases: Embedding model selection, threshold tuning
Retrieval ranking metrics
Did the system surface the right documents?
Metrics: P@k, R@k, MRR, NDCG
Use cases: Retrieval system evaluation, ranking quality assessment
Close does not mean relevant
A query about mobile conversion might have high similarity to a document about desktop conversion because both mention "conversion" — but the document is not relevant.

Precision@k: Of the top k retrieved, what fraction are relevant?

✓ Relevant
Rank 1: doc_schema_users
✓ Relevant
Rank 2: doc_metrics_conversion
✗ Irrelevant
Rank 3: doc_schema_desktop
✓ Relevant
Rank 4: doc_schema_sessions
✗ Irrelevant
Rank 5: doc_metrics_engagement
0.60
Precision@5 = 3 / 5
High precision = low noise in context window

Recall@k: Of all relevant documents, what fraction appear in the top k?

All relevant docs in corpus
4 total relevant documents exist
• doc_schema_users
• doc_schema_sessions
• doc_metrics_conversion
• doc_context_mobile
Top k=5 retrieved
3 of 4 relevant docs retrieved
✓ doc_schema_users
✓ doc_metrics_conversion
✓ doc_schema_sessions
✗ doc_context_mobile (MISSING)
+ 2 irrelevant documents
0.75
Recall@5 = 3 / 4
High recall = comprehensive coverage

MRR optimizes for "first good answer" scenarios

1
First relevant at rank 1
MRR = 1.0 (perfect)
2
First relevant at rank 2
MRR = 1/2 = 0.50
5
First relevant at rank 5
MRR = 1/5 = 0.20 (poor)
Mean Reciprocal Rank (MRR)
Formula: 1 / (rank of first relevant document). Average MRR across all queries for system-level score. Optimized for products where users need one good answer fast.

NDCG ranks highly relevant above partially relevant

Actual ranking
System retrieval order
Rank 1: relevance 2 (highly relevant)
Rank 2: relevance 0 (irrelevant)
Rank 3: relevance 1 (partial)
Rank 4: relevance 0 (irrelevant)
Rank 5: relevance 0 (irrelevant)
Ideal ranking
Perfect ordering (NDCG = 1.0)
Rank 1: relevance 2 (highly relevant)
Rank 2: relevance 1 (partial)
Rank 3: relevance 0 (irrelevant)
Rank 4: relevance 0 (irrelevant)
Rank 5: relevance 0 (irrelevant)
NDCG uses graded relevance + position discounting
Normalized Discounted Cumulative Gain. Highly relevant documents at rank 1 contribute more than at rank 5. Compare actual DCG to ideal DCG, normalize to get NDCG.

Choose the metric based on your product's retrieval objective

Product Required Metric Reason
Legal search tool High recall@k Missing a precedent is dangerous
QA chatbot High MRR First answer must be right
LLM context window (3 docs) High precision@k Every irrelevant document wastes tokens
The metric must match your product requirement

What will precision@3 and recall@3 be for this retrieval result?

Retrieval results for query: "What was mobile checkout conversion in Q4?"
1
doc2 (relevant)
doc_schema_mobile_conversion
2
doc5 (relevant)
doc_metrics_checkout
3
doc1 (irrelevant)
doc_schema_desktop
Context
3 total relevant documents exist in corpus. Write your predictions before running the next cell.

Precision@3 and recall@3 both equal 0.667 — but measure different properties

0.667
Precision@3
2 of the top 3 documents are relevant
0.667
Recall@3
2 of the 3 total relevant documents appear in top 3
Precision and recall measure different properties
They only equal each other when k = total relevant documents. Precision measures noise (are retrieved docs relevant?). Recall measures coverage (did you find all relevant docs?).
Increasing k to improve recall typically decreases precision (more irrelevant documents retrieved)

Retrieval metrics computed for a single query across 5 ranked documents

Rank Document ID Relevance Counts for P@3 MRR (1/rank)
1 doc_schema_users 2 (relevant) Yes 1.000
2 doc_schema_sessions 0 (irrelevant) No --
3 doc_metrics_conversion 1 (partial) Yes --
4 doc_schema_desktop 0 (irrelevant) -- --
5 doc_metrics_engagement 0 (irrelevant) -- --
0.667
Precision@3
0.667
Recall@3
1.000
MRR
0.82
NDCG@5

Build a Retrieval Quality Report for the AI Data Analyst

Base version (all students, 20-25 min)
  • Compute mean P@3, R@3, MRR, NDCG@5 across 300 queries
  • Interpret the precision/recall tradeoff
  • Fill the Retrieval Quality Report template
Extend version (DS/Eng, +10-15 min)
  • Filter to adversarial queries, recompute metrics
  • Identify top 5 queries with largest precision drop
  • Compute schema match rate for oracle queries
  • Show one retrieval miss leading to SQL failure

Retrieval Quality Report with aggregate metrics and adversarial segment analysis

Retrieval Metrics: Non-adversarial vs Adversarial Queries
Precision@3
0.88
0.63
Non-adversarial
Adversarial
Adversarial queries show 25% drop in precision@3
What you built: Retrieval quality report showing P@3 = 0.88 for non-adversarial queries, dropping to 0.63 for adversarial queries, and 72% of oracle queries missing correct schemas.
This artifact becomes part of your running metric suite alongside LLM judge metrics

Four ways teams misuse retrieval metrics

  • Conflating retrieval and generation failures — measure end-to-end only, can't isolate root cause
  • Choosing precision when you need recall — legal search optimizes for selectivity, misses 60% of relevant cases
  • Ignoring adversarial retrieval failures — aggregate metric masks segment-level degradation
  • Using cosine similarity thresholds without validation — 0.75 threshold is too high for some queries, too low for others

When precision and recall conflict, which metric wins?

Scenario 1
Retrieval system: P@5 = 0.90, R@5 = 0.45 — PM asks: Is this good to ship?
Scenario 2
Strong retrieval metrics (P@3 = 0.85, NDCG@5 = 0.78) but SQL generation still fails — how do you diagnose?
Scenario 3
Model A: higher cosine similarity but lower NDCG@5. Model B: lower cosine similarity but higher NDCG@5 — which metric drives your decision?
Think through these scenarios before the next lesson
What additional context do you need? What framework would you apply? How does product objective influence your answer?

Retrieval Evaluation: Metrics by Product Objective

Retrieval Objective
Relevance Type
Binary relevant/irrelevant →
Graded relevance →
MRR
QA systems, simple search — first good answer
Recall@k
Legal search, research — comprehensive coverage
Precision@k
Limited context windows — every doc counts
NDCG
Recommendation, multi-doc synthesis
First good answer
Comprehensive coverage
Two principles
Evaluate retrieval separately from generation. Choose the metric based on your product requirement, not by default.

Next: Semantic metrics with human and model judges

• When exact-match and retrieval metrics aren't enough
• Building LLM-as-judge scorers for semantic quality
• Human-judge baselines and calibration
AI Analyst Lab | AI Evals for Product Dev | Week 3 Lesson 4 | aianalystlab.ai