Monday: SQL correctness 87%
Posted in Slack, no metadata saved
Wednesday: SQL correctness 83%
Different person, different run, posted in Slack
Question: Is this a regression?
Can't answer — no dataset version, model version, judge config, or reproduction steps