Final output: Chart looks plausible, numbers are real, but describes the wrong user segment
End-to-end metrics show SQL execution success = 0.91, narrative faithfulness = 0.78. System looks fine, but the problem is upstream.
Root cause: Retrieval failure masked by downstream success metrics