ML engineering eval harness drift: four ways the harness itself rots while the model stays still

Most teams worry about model drift and data drift. The thing that quietly invalidates a quarter of eval results is the harness drifting underneath them. The model can be perfectly stable, the corpus can be perfectly stable, and the score can still stop meaning what it meant last quarter because the harness changed. Four shapes of harness drift, four files that catch them, and one cron that gates merges.

Matthew Diakonov, Written with AI

Published May 6, 202611 min read

Direct answer (verified 2026-05-06)

Why does an ML eval harness drift?

Four reasons, and they are different from model drift and data drift. One, the rubric is edited without a changelog so historical scores stop being comparable. Two, the judge model rotates under you when the snapshot is not pinned (aliases like claude-sonnet, gpt-4o, gemini-1.5-pro are not snapshot-stable; the provider deprecates and rotates underneath them). Three, the fixture set was drawn from old traffic and stops representing today's intent distribution. Four, pass thresholds creep down to whatever the team can hit instead of staying anchored to what production needs. Each one has its own detection script and its own file in the repo.

Verified against Anthropic engineering: demystifying evals for AI agents and the OpenAI model deprecations page on 2026-05-06.

The four drifts, side by side

Most guides treat eval drift as a single phenomenon. It is four phenomena. Different files, different scripts, different fixes. Treating them as one is the reason teams keep getting blindsided by the next one after they thought they fixed the harness.

Feature	A harness without them	A harness with the four files in the repo
What 'drift' means	One word, 'drift,' usually meaning the model or the data shifted. The harness itself is treated as a fixed measuring stick. It is not.	Four distinct drifts inside the harness: rubric drift, judge-model drift, fixture drift, threshold drift. Each one is detected by a different script and fixed by a different file in the repo.
When it shows up	After the launch demo, when staging eval and prod outcomes diverge and the team starts looking at training data and inference paths instead of the harness.	Months 2 to 12 of operating the system. The model and the corpus can be perfectly stable while the harness silently changes meaning under you.
How rubric changes get tracked	Someone edits the rubric in slack or a notion doc. Old scores get re-quoted next to new scores in the same deck. The PM compares 0.87 last quarter to 0.91 this quarter and reports a 4 point lift that does not exist.	eval/rubric.yaml is version controlled. Every axis change has an entry in eval/rubric_changelog.md with old weight, new weight, date, owner, and the reason. Historical scores are stamped with the rubric SHA they were graded under.
How the judge model is pinned	The judge is configured as 'claude-sonnet' or 'gpt-4o' with no snapshot. The cloud provider rotates the underlying snapshot quietly. Scores drift by 3 to 6 points and nobody can tell whether the agent got worse or the judge got stricter.	eval/judge_pin.yaml stores the exact snapshot string for the judge model (e.g. claude-sonnet-4-6 or gpt-4o-2024-11-20), the snapshot string of any backup judge, and the model card hash. Score runs read from this file. A judge change is a PR.
How fixtures stay representative	The eval set was built in week 2 and never rotated. By month six the agent is graded on questions production users do not ask anymore. The score is high and the support queue is full.	eval/fixture_calendar.yaml tracks the trailing-90-day intent distribution and flags any fixture intent that has fallen below half its share. Fixture rotation is on a quarterly calendar with a re-baseline rule.
How thresholds avoid creep	When the eval starts failing, somebody lowers the threshold to make the gate pass. The next quarter, the gate fails again. Two years in, the threshold is 0.55 on what used to be a 0.80 axis and the team has forgotten why.	eval/threshold_log.yaml records every threshold change with the production incident or rubric change that justified it. Thresholds can move in either direction; they cannot move silently.

Walk through each drift, with the file that fixes it

Each drift below is a real failure mode we have seen on shipped systems. The fix is always a file the client owns in their own repo, never a managed service we keep control of. The Week 6 handoff includes all four plus the cron that ties them together.

Rubric drift

The rubric file gets edited and old scores stop being comparable to new ones. The fix is a changelog and a SHA stamp on every score record.

Six months after launch the rubric has six edits in it. Faithfulness floor moved from 0.74 to 0.78 because of an incident. Helpfulness was split into two axes (helpfulness and completeness) because the original axis was scoring two things at once. Tone was demoted from a 0.10 weight to 0.05 because nobody trusted the grader.

Every one of those edits is the right call. The mistake is making them invisible. A score from before the split is graded under a different rubric than a score from after the split, and pasting them next to each other in a quarterly review deck is the meeting where someone reports a phantom regression.

The fix is mechanical. Every score record carries the SHA of the rubric file it was graded under. eval/rubric_changelog.md is appended to (never rewritten). When a scorecard renders a comparison across rubric SHAs, it shows a banner that says "rubric changed between these runs" and a link to the changelog entries that landed in between.

Judge-model drift

The judge model is unpinned. The cloud provider rotates the snapshot. Scores move 3 to 6 points and nobody knows whether the agent or the judge changed.

On every engagement we have shipped, this is the one that bites first. The eval was wired up against a model alias like claude-sonnet, gpt-4o, or gemini-1.5-pro. Aliases are convenient, and aliases are exactly the lever the cloud provider uses to roll the underlying snapshot forward. OpenAI publishes a deprecation calendar for snapshot strings. Anthropic publishes a snapshot list for Sonnet, Haiku, Opus across versions. The aliases are stable; the model behind them is not.

When the snapshot rotates, the LLM-as-judge starts grading 3 to 6 points harder or softer depending on the change. Your agent's faithfulness score moves. The agent did not change. The corpus did not change. The judge did. The team spends a week looking for a code regression that does not exist.

The fix is two lines. First, the rubric grader reads the judge snapshot string from eval/judge_pin.yaml, not from an environment variable or a default. Second, the harness drift cron runs once a week and asserts that the snapshot string in eval/judge_pin.yaml is still in the provider's "active" list. If the provider has marked it deprecated, the cron opens a PR with a candidate replacement and a re-calibration plan.

Fixture drift (also called fixture rot)

The eval fixtures were drawn from week 2 traffic. By month nine, production traffic has shifted and the fixture set grades a distribution the agent no longer serves.

A typical agent picks up new intents every quarter as adjacent teams plug in their use cases. The fixture set, frozen at launch for repeatability, slowly stops representing the user. The score is still high because the questions the eval asks are still the questions the agent was originally good at. The support queue tells a different story.

The detection logic is straightforward. Sample a thousand traces from the trailing 90 days. Cluster them by intent. Compare the intent distribution to the fixture distribution. Any intent that has a 2x or more distribution shift, or any new cluster representing more than 5 percent of traffic that has zero fixtures, is a flag.

The fix is not "regenerate the fixture set." That destroys the historical comparison. The fix is fixture rotation. Each quarter, a fraction of fixtures are retired and replaced with new ones drawn from current traffic, with the rotation logged so a long-running score chart can be split into the eras of fixture sets it ran against.

Threshold creep

When eval starts failing, someone lowers the threshold to make the gate pass. Two years in, the gate stops gating.

The shape is always the same. A new model lands, eval drops below the floor, the launch is on the line, the threshold gets adjusted "just for this release" with a plan to fix it next quarter. It does not get fixed next quarter. The threshold sits at the new lower value forever and the next time the gate fails, the same story plays out.

The fix is a log file with a hard rule: no threshold change ships without an entry in eval/threshold_log.yaml that names the date, the owner, the old value, the new value, the production incident or rubric change that justified it, and a sunset condition. Sunset means "this threshold returns to the old value when X" where X is a measurable thing, not a vibe.

The CI gate parses the log. A threshold that has no entry is a hard build failure. A threshold whose sunset condition has been met but has not been bumped back is a soft warning that turns into a hard failure after 30 days.

scripts/check-harness-drift.sh

One shell script. Four checks. Runs nightly and on every PR that touches eval/. Exits non-zero on any drift the team has not signed off on in writing. This is the file that turns the four-file pattern into a gate instead of a wall poster.

scripts/check-harness-drift.sh

What it looks like when fixture drift trips

Three of the four checks are quiet most days. Fixture drift is the one that fires loudest, because intent distributions move faster than rubrics or judge snapshots.

cron output, 2026-05-06 03:14 UTC

The four files, in detail

These are the actual shapes we drop in client repos on a Week 6 handoff. They are intentionally boring, intentionally readable, intentionally diff-friendly. The team who inherits the harness on month nine has to be able to read these files without us in the room.

eval/judge_pin.yaml

The single file we wish every team had on day one. Pin the snapshot, ban the alias, hash the model card, demand re-calibration on every change.

eval/judge_pin.yaml

eval/fixture_calendar.yaml

What every fixture covers, what fraction of trailing-90 traffic it represents, and when it rotates. The check_fixture_distribution.py script reads this file. Fixture drift is the most common drift and the loudest to detect.

eval/fixture_calendar.yaml

eval/threshold_log.yaml

Every threshold change is in here, with a justification and a sunset condition that is a measurable assertion, not a vibe. The CI parses it. Lowering a gate without an entry is a hard build failure.

eval/threshold_log.yaml

The pattern, restated for the senior engineer who inherits this

Score records are stamped with three SHAs

Rubric SHA, judge_pin SHA, fixture_calendar SHA. Any score chart that spans more than one of these SHAs draws a banner showing which SHAs were active when. There is no "score across releases" without an explicit acknowledgement that the harness moved.

The cron fails closed

A drift the team has not acknowledged is a non-zero exit on scripts/check-harness-drift.sh. Non-zero exit blocks the next merge. The team can override (it is their repo) but the override has to be explicit and goes in eval/overrides.md with a date.

Nothing depends on us

Every script runs in the client's CI on the client's infrastructure. The judge can be Anthropic, OpenAI, Bedrock, Vertex, Azure OpenAI, or open-weight. There is no platform license, no vendor-attached runtime, no fde10x service that has to stay paid for the harness to keep working. After the Week 6 transfer session, the harness is the client's outright.

When a senior engineer would not bother with all four

Worth saying plainly. If you are still in week one of the agent's life, with a single judge run on a 50-case eval set and no production traffic, you do not need fixture_calendar.yaml or threshold_log.yaml yet. The rubric has had no time to drift, the judge has not rotated, the fixtures match traffic by definition because there is no traffic. The order we add them in on engagements: judge_pin.yaml goes in week 2 with the prototype. rubric_changelog.md goes in week 3 the first time we edit the rubric. fixture_calendar.yaml goes in week 5 once we have ingested enough production traces to compute a baseline distribution. threshold_log.yaml goes in week 6 with the production gate.

By the time the agent has been in production for two months, all four are doing real work. By month nine, three of the four have caught a drift the team would have otherwise found from a customer escalation.

Get the four files dropped in your repo by Week 6

Free scoping call. We name the senior engineer, the agent we are shipping, and the harness drift checks the runbook will leave behind for your team to own.

ML eval harness drift, FAQ

Is harness drift different from model drift or data drift?

Yes, and missing the difference is the costly mistake. Model drift means the model's outputs change. Data drift means the inputs change. Harness drift means the measuring instrument itself changes meaning. Rubric edits, an unpinned judge snapshot, fixtures drawn from old traffic, and threshold creep all leave the model and the corpus alone while quietly invalidating the comparison between today's score and last quarter's. A team that monitors only model drift and data drift can spend a week looking for a regression that does not exist because the harness moved.

Why does pinning the judge model matter when the cloud provider says aliases are stable?

Aliases like claude-sonnet, gpt-4o, and gemini-1.5-pro are interface-stable, not behavior-stable. The provider rotates the snapshot behind the alias as new versions ship and old ones deprecate. The interface is the same, the API parameters are the same, and the scoring distribution shifts by 3 to 6 points. Pinning to a specific snapshot string (e.g. claude-sonnet-4-6, gpt-4o-2024-11-20) is the only way to guarantee the judge that scored Q1 results is the judge that scores Q3 results. When the snapshot is deprecated, the change is a tracked PR with a re-calibration on the gold set, not a quiet drift.

How often should fixtures be rotated?

Quarterly is a defensible default for an agent serving a fairly stable user base. The right cadence is the one that keeps the trailing-90-day intent distribution within a 2x shift bound on every covered intent. Rotation is partial, not total. We retire the oldest 20 percent of fixtures and redraw new ones from current traffic. Regression fixtures, the cases that exist because a prior bug shipped to users, never rotate out: they are the institutional memory of what the agent has already failed at. The rotation gets logged in eval/fixture_changelog.md so a score chart that spans the rotation is split into eras and not compared apples-to-oranges across them.

What is the actual file shape that catches threshold creep?

eval/threshold_log.yaml lives in the repo. Every threshold change has an entry naming the axis, the old value, the new value, the direction, the date, the owner, the justification (an incident link or a rubric_changelog reference), and a sunset condition. Sunset is a measurable assertion: 'revert when upstream auth p95 returns under 250ms for 14 days,' not 'we will revisit this.' The CI parses the log. A threshold in rubric.yaml without an entry fails the build. A sunset condition that has been met but not bumped back is a warning for 30 days, then a build failure. The system is mechanical. The team cannot lower a gate without a paper trail and a plan to raise it again.

Where does this fit in a Week 6 handoff?

On every engagement, the Week 6 handoff includes a runbook, an eval harness, and an architecture doc. The four files in this guide (rubric.yaml + rubric_changelog.md, judge_pin.yaml, fixture_calendar.yaml, threshold_log.yaml) plus scripts/check-harness-drift.sh are part of the eval harness deliverable. They live in the client's repo, run on the client's CI, and rely on no fde10x infrastructure. The 90-minute transfer session at the end of Week 6 walks the client's senior engineer through what each file does and what failure modes the cron catches. The point is that month nine, after we are gone, the harness still tells the truth.

Does ragas help with any of this?

Ragas helps with the per-axis grading: faithfulness, context recall, context precision, answer relevance, helpfulness. It does not solve harness drift. Ragas itself runs against a judge model whose snapshot you have to pin, against a fixture set you have to keep current, against thresholds you have to keep honest. Adopting ragas without the four files in this guide gives you a nicer measurement instrument that drifts in exactly the same way a hand-rolled judge would. The two practices are stacked, not substitutes.

Why not regenerate the fixture set on every rotation instead of partial rotation?

Because it destroys the historical score. A full regenerate means today's 0.84 is uncomparable to last quarter's 0.81. The team loses the ability to claim or measure improvement across rotations. Partial rotation keeps a stable backbone of cases (typically 60 to 80 percent) while refreshing the surface that has shifted. The retired fixtures are archived, not deleted, so a one-off comparison against the old set is still possible if a stakeholder asks. The trade-off is intentional: representativeness improves slowly, comparability stays.

Can a team run all four checks without an external service?

Yes. The whole pattern is four YAML files, one shell script, and three short Python scripts under eval/. There is no platform license, no vendor-attached runtime, no external grading service required. The judge model can be any provider the client picks, including open-weight running on their own infrastructure. The point of dropping these into the client repo, rather than running them as a managed service, is that the client owns the harness and the runbook outright after Week 6 and the drift detection keeps working whether or not we are still in the standup.

Other guides on the eval harness shape we ship on engagements

Adjacent reading

Eval harness

Agent eval set, model swap, trust

Two trust gates (kappa over 0.85, eval-to-prod Spearman over 0.7) and three pinned artifacts (judge, dataset, candidate) before any bake-off score is allowed to weigh in on a model swap.

Read

Engagement shape

FDE Week 2 prototype rubric

What the rubric looks like on day fourteen, how it gates the prototype demo, and why it is the same file that becomes the production gate by Week 6.

Read

Rubric design

Long horizon agents: the week 3 rubric stall

Why a rubric that scored 0.86 on the prototype slice flatlines at 0.45 on full multi-turn traces, and the eight-axis horizon-aware rubric that fixes it.

Read

Why does an ML eval harness drift?

The four drifts, side by side

Walk through each drift, with the file that fixes it

Rubric drift

Judge-model drift

Fixture drift (also called fixture rot)

Threshold creep

scripts/check-harness-drift.sh

What it looks like when fixture drift trips

The four files, in detail

eval/judge_pin.yaml

eval/fixture_calendar.yaml

eval/threshold_log.yaml

The pattern, restated for the senior engineer who inherits this

Score records are stamped with three SHAs

The cron fails closed

Nothing depends on us

When a senior engineer would not bother with all four

Get the four files dropped in your repo by Week 6

ML eval harness drift, FAQ

Adjacent reading

Agent eval set, model swap, trust

FDE Week 2 prototype rubric

Long horizon agents: the week 3 rubric stall

Comments (••)

Comments ()