Guide, topic: gold set drift detection and the monthly relabel ritual

Your bake-off score moved 1.8 points and your users did not notice. The gold set drifted underneath you.

Most teams treat the labeled reference set as ground truth. It is not. It is an instrument that ages, and the rot shows up in three numbers: label flip rate when you blind relabel, Jensen Shannon divergence between gold and 14 days of production traffic, and judge kappa decay against the same pinned snapshot. This is the monthly ritual we ship into client repos to catch all three before they kill a model swap decision.

Matthew Diakonov, Written with AI

Published May 7, 202612 min read

Direct answer, verified 2026-05-07

Agent eval gold set drift is the slow rot of a labeled reference dataset relative to production reality. It has three measurable signals: label flip rate when the owning engineer blind relabels the same cases 30 days later (under 8 percent is healthy), Jensen Shannon divergence between gold set intents and 14 days of production intents (under 0.4 is healthy), and Cohen's kappa decay between the same pinned judge snapshot run twice on the same cases 30 days apart (under 0.05 is healthy). Two out of three over threshold means the gold set is no longer a trustworthy verdict on model quality and the next bake-off result has an unknown error bar.

0%Label flip ceiling on monthly relabel

0Max gold to production KL divergence

0Max judge kappa decay over 30 days

0minEngineer time per monthly ritual

The failure mode this guide is about

The shape is recognizable. A new candidate model lands. The bake-off runs against the gold set that was assembled when the agent first hit production. The new model wins by 1.8 points aggregate, 2.4 on the high-stakes slice. The team ships it. Nothing happens. Tickets do not change. CSAT does not move. P50 latency moves a little, in the wrong direction. Three weeks later somebody asks why the swap mattered and the only answer is that the score went up.

The bake-off was not lying. The gold set was. It still graded cases the agent rarely sees and underweighted intents that have grown in production since the set was assembled. The new model was better at the past. The past is not where users are.

The fix is not a fancier rubric or a bigger judge. It is a recurring monthly check on the gold set itself, with three specific signals and a clear remediation for each. We call it the gold drift card. It runs in 60 to 90 minutes of one engineer's time, regenerates an evals/gold_drift_card.md in the repo, and is signed by the engineer who runs it. CI rejects any model swap PR whose card is more than 35 days old.

Three signals, in the order they fire

Each signal targets a different generator of drift. Run all three; the diagnosis lives in which one trips. A rolled-up single number smears the cause and forces you to reverse engineer the components on the fix side anyway.

Signal 1: label flip rate under 8 percent on a 30 day relabel pass

Hand the same engineer the same gold set 30 days after they last labeled it, blind. Count the cases where their label flips. Healthy gold set: under 8 percent flip rate. Between 8 and 15: the rubric drifted, the engineer's product opinion sharpened, or both, and you owe yourself a rubric rev. Above 15: the gold set is no longer self-consistent and grading model swaps against it is theater.

The flip set itself is the most useful artifact in the whole ritual. Read it case by case. About a third of the flips are usually rubric sharpening (the engineer now distinguishes a class they used to lump), a third are product changes (the upstream flow changed and the right answer is now different), and a third are honest disagreements with their past self. The first two get the rubric updated. The third is a signal that this engineer is too close to the work and a second labeler should sample-grade.

Signal 2: KL divergence under 0.4 between gold set intents and the trailing 14 days of production

Tag every gold case with an intent label (refund, billing, address change, escalation, long-context legal, etc) and tag a 14 day production sample with the same labels. KL divergence between the two distributions is the staleness number. Under 0.2 is great. 0.2 to 0.4 is fine if the gold set covers the high-stakes tail densely. Over 0.4 means production has moved on without you and the gold set is grading flows users barely run anymore.

We use Jensen-Shannon when an intent appears in only one distribution, because raw KL goes to infinity. Either is fine, the threshold moves accordingly. The remediation is a remine pass: pull 60 to 100 fresh traces from the trailing 14 days, hand-label them, fold half into the gold set as additions and retire half of the staler cases the deletion ledger flags. Net gold set size barely moves; the distribution does.

Signal 3: judge kappa decay under 0.05 across 30 days, same gold set, same rubric

Pin a judge snapshot. Run it on the gold set today and write down kappa against the human labels. Run the same pinned snapshot on the same gold set 30 days from now. If kappa drops by more than 0.05, something on the judge side moved that you did not opt into: the snapshot string was an alias the vendor rotated, the rubric prompt was edited mid-month, or the few-shot exemplars were swapped. Catch it before it shows up as a 2 point swing on a model swap bake-off.

The trick that makes this cheap: the judge run on the unchanged gold set is fully cacheable. Hash the judge model snapshot, the rubric prompt, the few-shot exemplars, and the gold case ID into a cache key. A normal monthly check costs the price of one gold case evaluation per actual change. We have caught two judge-side rotations this way, both vendor alias drift, neither publicly announced.

Gold set drift is not the same thing as harness drift

These two failures get mixed up online and they are not the same. Gold set drift is the labeled reference cases drifting. Harness drift is the machine that grades them drifting. Both can rot a bake-off in a hidden way; the symptoms and the fixes differ. The comparison below is the cleanest separation we have used to keep the two diagnoses straight.

Feature	Harness drift	Gold set drift
What rots	The harness around the reference cases	The labeled reference cases themselves
Surface where you notice it	Same model run on same cases produces different scores week over week	Bake-off score moves but production user satisfaction does not
Primary signal	Output diff between the same harness run twice on the same inputs	Label flip rate when the owner relabels their own gold set blind
Cadence	Continuous, ideally a CI step that fails fast on harness state changes	Monthly ritual, 60 to 90 minutes of one engineer
Failure mode if ignored	The eval pipeline reports phantom regressions and trust collapses	You ship a model swap that scored higher on cases nobody runs anymore
Where the fix lives	scripts/check-harness-drift.sh + judge prompt + sampler config	evals/gold_set/*.jsonl + evals/gold_drift_card.md

The monthly ritual: three commands

The ritual fits in 60 to 90 minutes for a 60 to 80 case gold set, once a month. Three commands. The names are the names we ship in client repos; rename them to fit your stack.

Mine: pnpm gold:mine

Pulls 60 to 100 fresh traces from the trailing 14 days, deduplicates against the existing gold set on a normalized hash, and writes the candidates to evals/mined/<date>.jsonl. Five minutes of waiting, then five minutes of the engineer skimming the output.

Relabel: pnpm gold:relabel

Opens the entire gold set in a small in-repo TUI, blind to past labels. The engineer relabels everything in one sitting (about 30 minutes for a 67 case set). The CLI computes flip rate at the end and writes evals/labels/<date>.jsonl. Yes, you relabel everything every month. The cost is the point: it forces you to confront how the set has aged.

Diff: pnpm gold:diff

Computes all three signals against the mined sample, the new labels, and a cached 30 day old judge run, then regenerates evals/gold_drift_card.md. Status is one of PASS, FLAGGED, FAIL. The card is signed by the engineer in the same commit. CI rejects model swap PRs whose diff card is more than 35 days old.

What the ritual prints

The terminal output is short on purpose. Three signals, three statuses, one overall verdict. If the run takes more than five minutes of compute, the harness is doing too much and you can cache more aggressively against the past month's judge outputs.

bash

The artifact: `evals/gold_drift_card.md`

One file in the repo, regenerated by the ritual, signed by the engineer. Numbers below are from a real engagement, slice names lightly anonymized.

evals/gold_drift_card.md

Why Jensen Shannon and not raw KL

Production traffic almost always contains an intent the gold set has zero of. A new flow ships, a marketing campaign opens a category, regulation introduces a new dispute type. Raw KL divergence has a divide-by-zero against the gold side and goes to infinity. That is correct as information theory and useless as an alarm threshold. Jensen Shannon is symmetric, bounded in zero to log two, and degrades gracefully when the support of one distribution does not cover the other.

The whole computation is small enough to read in one sitting. We keep it under 30 lines so any engineer on the team can audit what the alarm threshold is actually measuring.

evals/gold_drift/kl.py

What the three signals mean when they trip

The diagnosis follows from which signal fires. Most of the time it is one signal at a time, and the remediation is small and local.

Signal 1 trips alone: the rubric drifted or the engineer's product opinion sharpened. Read the flip set case by case, fold the new opinion into the rubric, regenerate the judge prompt exemplars, re-run signal 3 to confirm the rubric edit did not break judge alignment.
Signal 2 trips alone: production has moved on without you. Run a remine pass, fold 6 to 12 fresh cases on the new high-prevalence intents into the gold set, retire 6 to 12 stale cases via the deletion ledger, re-run signal 1 next month.
Signal 3 trips alone: the judge moved. Either the vendor rotated under the alias (check release notes), the judge prompt was edited mid-month (check git blame on the prompt file), or the few-shot exemplars changed. Pin the snapshot harder, hash the prompt into the cache key, run a one-time recalibration on the gold set with the human labels.
Two of three trip: do not ship a model swap until the card is green. The bake-off result is uninterpretable while two signals are red.
All three trip: the gold set is past its useful life. Schedule a half-day workshop with the engagement owner, rebuild a fresh gold set against the trailing 30 days of production, retire the old one to evals/gold_set/archive/<date>.jsonl, and start the ritual clock over.

Where the thresholds came from

None of these numbers fell out of a paper. They are calibrated against six client engagements through 2025 where we ran the ritual monthly and recorded the score against whether a shipped model swap survived two weeks in production without a rollback. The healthy band is the band where swaps survived; the unhealthy band is the band where they did not.

The thresholds are working numbers, not laws. Tighten them for safety-critical domains, loosen them where the gold set is unusually small or unusually large. The discipline is picking a number, writing it into the card, and noticing when it crosses. The number can be wrong; the writing it down is what stops the drift from being invisible.

What this guide is not

It is not a substitute for a trace pipeline. The gold drift card grades the gold set; the trace pipeline grades production. The two together are the trust system. The card alone tells you the eval set is healthy and the agent could still be failing in flows the gold set never sampled. The pipeline alone tells you the agent is failing and gives you no leading indicator. The downstream trust card and the related regression eval set ratchet cover the rest of the system the gold drift card slots into.

It is also not the same problem as harness drift. The harness drift article covers the symmetric failure where the grading machine moves and the cases stay still. Both rituals are cheap. Run both.

Where fde10x fits

We are a forward deployed ML engineering studio. Named senior engineers go inside the client's GitHub, Slack, and standup in week one, ship a working agent prototype in client staging by end of week two, and hand off the runbook, the eval harness, the gold drift card, and the trust card on week six. The card lives in the client repo from the first commit. The client owns the gold set, the rubric, the thresholds, and the deletion ledger. We are model vendor neutral, so the judge can be any combination of Anthropic, OpenAI, Bedrock, Vertex, Azure OpenAI, or open weight. No platform license, no vendor attached runtime.

Plenty of teams build this themselves. The embed is the right call when the team is currently arguing about model swaps in three Slack threads, has rolled back a swap inside a quarter, or has a senior MLE hiring gap that is blocking the bake-off cadence from running monthly.

Want a senior engineer to ship the gold drift card and the eval harness inside your repo?

60 minute scoping call with the engineer who would own the build. You leave with a draft gold set we would label, the three thresholds we would set, the judge we would pin, and a fixed weekly rate to ship the harness, the deletion ledger, and the first signed gold drift card in two to six weeks.

Gold set drift, the three signals, and the monthly ritual, answered

What does agent eval gold set drift actually mean?

It is the slow rot of the labeled reference cases your eval harness uses as ground truth. The model has not changed, the harness has not changed, but the gold set is no longer a faithful representation of what the agent has to handle. Three things move underneath you: production traffic shifts so the gold set covers flows users barely run, the engineer's own product opinion sharpens so they would label some cases differently today, and the judge model rotates under a vendor alias so the same rubric grades the same cases differently. Any of the three breaks the link between the bake-off score and the truth.

How is this different from data drift, concept drift, or model drift?

Those terms are about the world or the model moving. Data drift means production inputs no longer match training distribution. Concept drift means the relationship between inputs and correct outputs changed. Model drift means the deployed model is now stale relative to a newer baseline. Gold set drift is the labeled reference dataset that grades all of that getting stale itself. None of the standard data drift detectors look at it because it sits behind them, in the eval harness. The gold set is your measuring stick, and a measuring stick has its own decay curve.

Why three signals instead of one?

Because the three drifts have different generators and need different remediations. Label flip rate measures rubric sharpening and product change. KL divergence against production measures coverage staleness. Judge kappa decay measures judge-side rotation. A single number rolling all three up would smear the diagnosis and tell you the gold set is bad without telling you which knob to turn. We tried the rolled-up version on two engagements in 2025 and the engineers ended up reverse engineering the components anyway, so we just shipped the components.

Why blind relabel and not a diff against the existing labels?

Anchoring. Anyone who can see the previous label converges on it within seconds. Blind relabel forces the engineer to grade the case as if they had never seen it, and the divergence between past self and present self is exactly the rubric drift you are trying to measure. Yes, it costs more time. About 30 minutes for a 67 case set, once a month. The cost is the gate against the failure mode where the gold set silently drifts because nobody made themselves regrade it.

Why 8 percent on the flip rate threshold? Where does that number come from?

From relabel runs on six client gold sets in the back half of 2025, all in the 50 to 90 case range, all owned by the engineer who originally labeled them. Healthy ones came in at 3 to 7 percent. The two that crossed 10 percent both turned out to have rubric edits in the previous month that the engineer had not internalized. The number is a working ceiling, not a law. Tune it for your domain. Customer support tends to come in lower, anything legal or medical comes in higher because the rubric has more discretion built in.

Why 0.4 on KL divergence and why Jensen-Shannon?

Production routinely contains an intent the gold set has none of. Raw KL goes to infinity in that case, which is correct on paper and useless as an alarm threshold. Jensen-Shannon is symmetric, bounded, and degrades gracefully. The 0.4 number is calibrated against the same six engagements: a healthy gold set running JS at 0.15 to 0.30 against trailing 14 day production. Above 0.4 we have always found at least one new high-traffic intent the gold set was missing. If your domain has a long tail of rare intents, you can lift the threshold to 0.5, but you also commit to a tighter sampling cadence on the high-stakes intents to compensate.

Why is judge kappa decay measured against the same snapshot string?

Because a real judge change should be deliberate. If you pinned to claude-sonnet-4-5-2026-04-15 and a month later the same snapshot string grades the same cases differently, something off-protocol moved. Either the vendor rotated under the alias (it happens, two of our 2025 engagements caught it this way), the judge prompt got edited without bumping its hash, or the few-shot exemplars were swapped. The 0.05 threshold is small enough to catch real rotations and large enough to ignore the noise of a sampled judge call where temperature is not zero. If you are running judge at temperature 0 the threshold can drop to 0.02 and the alarm gets sharper.

What if the gold set is small, say 25 cases?

All three signals get noisier and harder to act on. Flip rate at 25 cases has a sample variance high enough that one or two flips look like an alarm. KL divergence against an 11 dimensional intent distribution is undefined for several intents. Kappa at n equals 25 has a wide confidence interval. You can still run the ritual, but you read it as a directional signal not a gate. Better is to spend a sprint hand grading another 30 cases and lift the gold set to 55 to 70. Below 50 the bake-off itself is a vibe call, drift card or no.

What does the deletion ledger have to do with gold set drift?

Cases removed from the gold set without a recorded reason are the most common silent driver of drift. Engineer A removes a case in week three because they think it is redundant, engineer B sees the same case fail in production six weeks later, neither knows. The deletion ledger is one append only file, evals/gold_set/deletions.jsonl, with case ID, removal date, removing engineer, and one line of rationale. Lint refuses any PR that drops a case ID without writing to the ledger. Existing playbooks treat the gold set as immutable and treat additions as the only edit; the deletion ledger is what makes deletions safe.

How does this fit with the model swap trust card?

The gold drift card runs upstream of the trust card. The trust card asks: is this eval set trustworthy enough to grade a model swap. The gold drift card answers part of that question: is the gold set itself trustworthy enough to be the eval set's spine. If the drift card is FAIL, the trust card is automatically NO without running. If the drift card is PASS or FLAGGED, the trust card runs its own three gates on top. We see the two cards regenerated together in CI on most engagements once the harness is stable.

Does this need a special harness or can I add it to the one I have?

It bolts onto whatever harness you already run. The three commands wrap small Python scripts: about 60 lines for mine, 200 for the relabel TUI, 120 for diff. Total under 400 lines. The only stateful piece is the cache of past labels and past judge runs, which lives in the repo as JSONL. We have shipped it on top of homegrown harnesses, ragas, and one client's pytest-based eval suite. The point is the ritual, not the framework.

Where does fde10x sit in this?

We are a forward deployed ML engineering studio. Senior engineers go inside your repo on week one, ship a working agent prototype in client staging by end of week two, and hand off the runbook, the eval harness, the gold drift card, and the trust card on week six. The drift card and the gold set live in your repo from day one, regenerated by your CI, signed by the named engineer. We are model vendor neutral, so the judge can be anything from Anthropic, OpenAI, Bedrock, Vertex, Azure OpenAI, or open weight. No platform license, no vendor attached runtime.

The trust card, harness drift, and regression set contracts the gold drift card plugs into

Related guides

Trust card

Agent eval set, model swap, trust: the two number trust card

The downstream calibration card the gold drift card feeds. Cohen's kappa over 0.85, eval to production Spearman over 0.7, three pinned snapshots, one signed page before any model swap is read.

11 min readRead

Harness

ML engineering eval harness drift: four ways the harness rots

The sister failure mode. Where the gold set drift article asks whether your reference cases are still right, the harness drift article asks whether the machine grading them is still grading the same way.

10 min readRead

Eval set

Agent regression eval set: the ratchet rule and the deletion ledger

How the gold set is allowed to grow and shrink without rotting backward. The 80 line lint that refuses to merge a PR that removes a case ID without a recorded rationale.

14 min readRead