Guide, topic: agent eval set trust calibration before a model swap

Before you trust the eval set's verdict on a model swap, prove the eval set itself is trustworthy.

Most teams treat the agent eval set as ground truth. The new model scores higher, the swap goes on the roadmap, and three weeks later a customer files a ticket about a refund flow that regressed. The bake-off was not wrong. The eval set was. This guide is the calibration protocol we ship into client repos before any model swap is allowed: a 50 case gold set, two specific numbers, three pinned versions, and a one-page trust card the engagement owner signs.

Matthew Diakonov, Written with AI

Published April 28, 202611 min read

The thing nobody asks about a bake-off

Here is the question that does not get asked when a model swap is on the table: how confident are we that the eval set we are grading against is still measuring the thing we care about? Everyone asks how the new model scored. Almost nobody asks whether the score itself is trustworthy. The eval set is implicitly treated as ground truth, when in fact it is an instrument that drifts, ages, and gets graded by a judge that also drifts.

Two specific drifts kill swaps. First, the judge model rotates (a vendor pushes a new snapshot under the alias you pinned, or you swap to a cheaper judge to save tokens) and the same rubric grades the same cases differently. Second, the eval set ages: production traffic shifts, new intents emerge, old intents get resolved upstream, and the eval set no longer reflects what users are actually doing. Either drift makes the bake-off score a number with an unknown error bar. With an unknown error bar a 1.5 point delta is meaningless.

The fix is a calibration step that runs before the bake-off. Two numbers, three pins, one signed page. We call it the trust card. The model swap decision rule changes from "is the new model better" to "is the new model better, and is the eval set trustworthy enough that the answer matters."

The three trust gates, in the order they run

Each gate produces a number and a status. The bake-off result is not opened until all three have a status. A failed gate does not mean abandon the swap; it means fix the gate first and run the bake-off second.

Gate 1: judge-human agreement on a 50-case gold set

Cohen's kappa between the LLM judge and a hand-graded gold set has to clear 0.85 before the eval set is allowed to weigh in on a model swap. Below 0.85 the judge is grading something other than what your humans care about, and the bake-off is a thermometer pointed at the wrong room.

50 cases is the floor. 100 is better. The gold set is hand-labeled by the engineer who owns the agent, not an annotation vendor, because the labels carry product opinion that vendors do not have. Re-label the gold set every two weeks. If kappa drops, the judge prompt has drifted or the judge model itself has changed snapshots; both happen quietly.

Gate 2: eval-to-production correlation over the trailing 14 days

Spearman rho between per-case eval scores and per-case production outcomes for the same intents has to clear 0.7. If the eval set ranks a slice as healthy and production grades the same slice as failing, the eval set is measuring the wrong thing and no bake-off result built on it is allowed to ship.

The production grade is a small ongoing process: 30 to 80 traces a day sampled from real traffic, run through the same rubric the eval set uses, with the same judge model. The correlation is computed across intents (head, tail, high-stakes, adversarial, long-context). One slice that decouples from production is a kill-switch on the slice, not a kill-switch on the entire bake-off, but it has to be flagged on the trust card.

Gate 3: pinned judge, pinned dataset, pinned candidate set

Three things get hashed and recorded before the bake-off runs: the judge model snapshot, the eval set commit SHA, and the candidate models with their exact snapshot strings. Anything that floats invalidates the comparison and is treated as a failed gate. This is how you catch the silent retest where someone re-ran with a different judge and the numbers shifted by 4 points.

We pin to a snapshot string the vendor publishes (claude-sonnet-4-5 on a date, gpt-4.1-2025-04-14, gemini-2.5-pro-05-06). We do not pin to an alias like "latest" because aliases change underneath you. The eval set is committed to the client repo and the SHA is in the trust card; if the next bake-off uses a different SHA, the rubric flags the comparison as not-strictly-comparable and the shipping owner has to acknowledge that on the page.

The artifact: `eval_trust_card.md`

The trust card is one markdown file in the client's repo, checked in alongside the eval set. It regenerates from one command and the engagement owner signs it before the bake-off result is read. The shape we ship looks like this. The numbers below are from a real engagement (slice names lightly anonymized).

evals/trust_card.md

What the regen actually prints

The trust card is regenerated by a single command. The output is short on purpose: three gate results, an overall status, a one-line note when a gate flags. If the run takes more than three minutes for a 67 case gold set and 14 days of trace samples, the harness is doing too much; trim it.

bash

What the trust card catches that a bake-off alone does not

The trust card sits upstream of the bake-off. It is a calibration step, not a comparison step. The comparison below is between a swap process gated by a trust card and the more common swap process where the eval set is treated as ground truth and the bake-off score is taken at face value.

Feature	Bake-off only swap	Trust card gated swap
What gets read first when a new model lands	The bake-off summary slide. Aggregate score, per-slice scores, recommendation. The eval set's trustworthiness is assumed because no one is asking the question.	The trust card. Two numbers and three pinned versions. If any gate fails the bake-off score is not opened. The shipping owner reads the trust card before the bake-off.
What happens when the judge has drifted	Nothing catches it. The bake-off runs with a quietly worse judge, the score numbers shift by 3 to 5 points, and the team interprets the shift as model behavior instead of judge drift.	Gate 1 catches it. Cohen's kappa drops below 0.85 on the next regen of the gold set, the trust card flips to FAIL, the bake-off is paused until the judge is repinned or the prompt is recalibrated.
What happens when production decouples from the eval	The decoupling is invisible until a customer-facing failure surfaces it. The bake-off was green on a slice that was no longer measuring production reality.	Gate 2 catches it. Spearman rho drops on the affected slice, the trust card flips to FLAGGED with the slice named, the swap is allowed for other slices but blocked on the decoupled one.
How model snapshot drift gets handled	The team aliases to whatever the vendor calls 'latest', the snapshot rotates, and the next bake-off compares apples to a slightly different apple.	Gate 3 pins the snapshot string for both judge and candidates. Any rerun against a different snapshot flags the comparison as not-strictly-comparable and forces an explicit acknowledgement.
Who signs off	Nobody signs anything. The bake-off result is in a Slack thread. Three months later nobody can reconstruct which judge or which eval SHA the call was made on.	The named engineer who built the trust card signs it, with timestamp, before the bake-off result is read. The signature is on the page in the repo. The postmortem in three months has a record.
What the engagement leaves behind	A spreadsheet of bake-off scores and a Slack message. The methodology lives in someone's head and leaves with them.	The trust card is part of the eval harness handoff in week 6. It runs from one command, regenerates monthly, and the client owns it. No platform license, no vendor-attached runtime.

CONDITIONAL PASS, and why it is the most common verdict

On a healthy eval set the trust card produces three statuses: PASS, FLAGGED, FAIL. PASS is the easy case. FAIL is the easy case in the other direction (the swap does not happen, the gate gets fixed, the swap runs again next week). The interesting case, and in our experience by far the most common, is FLAGGED on one or two slices.

FLAGGED on the high-stakes slice means the eval set's grades on refunds, billing, legal, or medical have decoupled from production grades on the same intents. The bake-off number for that slice is no longer trustworthy. The shipping decision is to swap on every other slice and keep the incumbent on the flagged one until the slice's eval set is re-mined and the correlation recovers. This is the hybrid deployment pattern. The trust card is what makes the case for it concretely instead of as a vibe call.

Decision rule, written down

All three gates PASS: bake-off result is read, swap proceeds if the per-slice rubric is green.
Any slice FLAGGED on gate 2: swap is allowed on every non-flagged slice; flagged slice stays on incumbent until correlation recovers.
Gate 1 FAIL or gate 3 FAIL: bake-off result is not read. Repin or recalibrate, regenerate the trust card, run the bake-off again. Total downtime is usually one to two days.

What the trust card is not

It is not a substitute for production observability. The cleanest trust card and the cleanest bake-off in the world still miss failure modes the eval set never sampled. The trust card validates the eval set as a leading indicator; the trace pipeline is the lagging indicator that catches anything the eval set could not. The two together are the shipping system. The trust card without the trace pipeline ships confident swaps that are still half blind. The trace pipeline without the trust card catches regressions in production after they hit users.

It is also not a license to skip the bake-off itself. The trust card answers "is the eval set trustworthy." The bake-off answers "is the new model better." Both questions have to be answered, in that order. Skipping the bake-off because the trust card was green is the symmetric error to running the bake-off without the trust card.

Where fde10x sits in this

We are a forward deployed ML engineering studio. Named senior engineers go inside the client's GitHub, Slack, and standup in week 1, ship a working agent prototype in client staging by end of week 2, and hand off the runbook, the eval harness, and the trust card on week 6. The client owns the repo, the eval set, the rubric, and the trust card. We are model vendor neutral, so the candidate set on the trust card can be any combination of Anthropic, OpenAI, Bedrock, Vertex, Azure OpenAI, or open-weight. No platform license, no vendor-attached runtime.

Plenty of teams build this themselves. The embed is the right call when the team is currently picking models by gut, rolling them back at a measurable rate, or arguing about the same swap decision in three different Slack threads. Once the trust card exists in the repo, the arguments stop, because there is one signed page that decides.

Want a senior engineer to ship the trust card and the eval harness inside your repo?

60 minute scoping call with the engineer who would own the build. You leave with a draft of the gold set we would label, the judge we would pin, the trace pipeline we would wire up, and a fixed weekly rate to ship the harness and the first signed trust card in two to six weeks.

Eval set trust, model swaps, and the trust card, answered

Why is judge-human agreement the first gate and not the bake-off itself?

Because the bake-off is a measurement and the judge is the measuring instrument. If the instrument is wrong, the measurement is decoration. Cohen's kappa on a 50-case gold set is a cheap, repeatable way to confirm the judge agrees with the engineer who actually owns the product. We have seen kappa drop from 0.91 to 0.74 in a single week after a vendor rotated the judge model under us. Without gate 1 we would have shipped a model swap on the bad numbers. With gate 1 we caught it, repinned the judge, and the bake-off ran on the clean numbers two days later.

Why Spearman correlation instead of just running the eval more often?

Because frequency does not solve drift. The eval set can be a perfect snapshot of production from three months ago and still be a bad predictor of production now. Spearman rho between eval scores and production grades on the same intents tells you whether the eval is still a leading indicator. If rho drops, the eval set has aged and needs a re-mining pass. The exact threshold matters less than the trend; what kills swaps is rho dropping by 0.15 across two weeks and nobody noticing.

Where do the production grades come from?

A small daily sample of real production traces, 30 to 80 a day, run through the same rubric and judge as the eval set. This is the trace pipeline the eval harness is paired with. The two together form the full trust system: the eval set is what you grade before shipping, the trace pipeline is what you grade after, and the correlation between them is what tells you the eval set is still doing its job. Either alone is half the answer.

What if I cannot get to kappa 0.85?

Two common causes. First, the judge prompt is graded at the wrong granularity (it is asking for a binary pass when humans are grading on three buckets, or vice versa). Rewrite the rubric so the judge is forced into the same shape as the human label. Second, the gold set is internally inconsistent because two humans graded it differently. Re-grade with one engineer doing all 50 cases, write down the edge cases, and use those edge cases as judge prompt exemplars. Most kappa misses we see are rubric problems, not judge problems.

How big does the gold set need to be?

50 to 80 cases is the working range. Smaller and the kappa estimate is too noisy to act on. Bigger and the regrade cost starts crowding out other engineering work. The right number is wherever you can afford to re-grade end-to-end every two weeks without it becoming an annoying chore. If it becomes an annoying chore the regrade gets skipped and the gate becomes theater.

Why pin to a snapshot string instead of a model alias?

Because aliases change underneath you. 'gpt-4.1' on April 1st and 'gpt-4.1' on April 28th can be different model snapshots if the vendor rotated. The shift is real and it is enough to move per-slice scores by 2 to 5 points. Pinning to gpt-4.1-2025-04-14 means the bake-off is comparing the same instrument across runs. When the vendor publishes a new snapshot, you opt into it deliberately by changing the pin in the trust card and re-running gate 1 to confirm the new snapshot still grades like the old one.

What does CONDITIONAL PASS mean and is it ever ok to ship on it?

CONDITIONAL PASS means one slice failed the trust gates while others passed. The bake-off result is allowed to drive a swap on the slices where the trust gates passed, and is blocked on the slices where they failed. The most common shape is a hybrid deployment: ship the new model for long-context flows where gates 1, 2, and 3 are all green, keep the incumbent for high-stakes flows where gate 2 flagged. The shipping owner signs off on the carve-out explicitly. CONDITIONAL PASS is never an excuse to ship the failed slice; it is a way to take wins where the trust is real.

How does this differ from the bake-off methodology pages already online?

The pages that currently rank teach you how to construct slices, how to write rubrics, how to pick a judge. They assume the eval set, once built, is the ground truth. This guide treats the eval set as a measurement instrument that itself needs a calibration check before its verdict on a model swap is taken seriously. The trust card is the calibration artifact. Without it the bake-off result is a number whose error bar is unknown.

Where does fde10x fit on this?

We are an embedded ML engineering studio. Senior engineers go inside your repo for two to six weeks to build the eval set, the trust card, and the bake-off harness, then leave the harness and the runbook with you on week 6. The trust card is part of the leave-behind IP: it lives in your repo, regenerates from one command, and the engineer who built it is named in the document. We are model vendor neutral, so the candidate set on the trust card can include any combination of Anthropic, OpenAI, Bedrock, Vertex, Azure OpenAI, or open-weight models. There is no platform license and no vendor-attached runtime.

Smallest version I can ship next week without an embed?

Three things. First, hand-grade 50 production traces against the rubric you would use for a bake-off; this is your gold set. Second, run the same judge over the same 50 cases and compute Cohen's kappa; if it is below 0.85, fix the rubric until it is not. Third, write a one-page markdown file in your repo with the kappa number, the judge snapshot, the eval set commit SHA, and your candidate models with their snapshot strings, and require it to be signed before any model swap. That single page is the smallest viable trust card. The Spearman gate can be added in week two once you have a trace pipeline running.