AI agent eval harness regression suite: the per-commit manifest, the median-of-N quorum, and the one-line bisect command

Most guides on this topic stop at "fail the build on a score drop." That gate flakes inside two weeks. The honest version writes a per-commit manifest, scores each case as the median of N runs, only blocks merges when the last K green commits agreed on the case being healthy, and ships a one-command bisect that resolves a regression alert to the SHA where the case first broke. This is the file shape, the rules, and the script we drop into client repos on a Week 6 handoff.

M
Matthew Diakonov
13 min read

Direct answer (verified 2026-05-07)

How does an eval harness block real regressions without flaking?

It writes a per-commit manifest at eval/_history/<sha>.json recording every per-case score at this SHA: N runs, median, p25, p75, judge_calls, plus the rubric_sha, judge_pin_sha, and fixture_sha that were active. The gate fails the build when both: a case's median at this commit is below its case_floor, and at least K-1 of the last K green commits had the same case above its floor. Aggregate score is reported but does not gate. A failing gate posts a PR comment with the bisect command python eval/bisect_regression.py --case-id <id>, which walks the append-only eval-history branch and prints the first SHA where the case crossed below its floor.

Verified against Anthropic engineering: demystifying evals for AI agents and OpenAI: testing agent skills systematically with evals on 2026-05-07.

The two regression-suite shapes, side by side

The version most articles describe runs the cases on every PR and fails the build when the aggregate score drops more than a threshold. The version that survives contact with production has per-case floors, multiple runs per case, a quorum check against recent green history, and a bisect entry point. This table is the difference, line by line.

FeatureThe 'fail the build on a score drop' shapeThe shape we ship in client repos
What 'regression' means in the gateAggregate score dropped more than 3 points since the last green run. One unlucky run can fail the build; one lucky run can hide a real drop. Nobody trusts the gate by week 4.A case's median score across N>=3 runs at this commit dropped past its per-case floor, and the last K green commits for that same case stayed above its floor. Two conditions, both required, both checkable from the per-commit manifest.
Where per-case scores liveA line in the build log. Last week's score is somewhere in the GitHub Actions artifact retention window. The history is a cron job from your eyes to a notion doc.eval/_history/<commit_sha>.json. One file per CI run. Every case's id, its N raw scores, its median, p25, p75, judge_calls, and the rubric_sha + judge_pin_sha + fixture_sha that were active. Committed by the bot to the eval-history branch and never deleted.
How flakes are suppressedOne run per case. A 0.06 swing because the judge LLM picked a different word fails the build. The team starts retrying flaky CI as a habit. The gate stops being a gate.Each case runs N times (default 3, sensitive cases 5). The median is the score of record. p75 - p25 is published; if the spread crosses a per-case stability_bound, the case is flagged unstable and either stabilized or dropped from the gate, not silently passing.
What happens when the gate failsBuild fails. Engineer reads the log, sees aggregate dropped, looks at the diff, guesses. If the prompt and the model both moved in the last week, the bisect is by hand and takes a day.PR comment names the cases that regressed, the median scores at this commit and the last green commit, the diff link, and the bisect command. Engineer runs python eval/bisect_regression.py --case-id <id> and gets the SHA where the case first crossed its floor.
What gets bisectedNothing. There is no per-case history, so there is nothing to bisect. The team reproduces the failure on main and works backward through git log by feel.The git history of eval/_history/. bisect_regression.py walks the history of one case_id and prints the first SHA where its median crossed below its floor. The SHA itself usually carries the diff that broke the case.
Cost of one CI runEither too cheap to be reliable (1 run per case, flaky) or too expensive to be sustainable (full Monte Carlo, 50 runs per case, $30 per PR, the team turns the gate off when finance asks).N x case_count judge calls. With 80 cases and N=3, that is 240 judge_calls per PR. With Sonnet-tier judging, roughly $1 to $2 of judge cost per PR. Costed and budgeted on day 1 of the harness.
Per-case floor vs aggregate floorOne aggregate threshold, applied to the mean across cases. A high-stakes adversarial case dropping from 1.0 to 0.0 can be hidden by 79 other cases moving up by 0.01. That is exactly the regression you were trying to catch.Each case has a per-case floor stored next to it in eval/cases.yaml. Aggregate is reported but does not gate. A case that has been important enough to add to the suite is important enough to defend on its own; aggregate-only gates lose individual regressions in the average.

Five rules that make the gate honest

Each rule below maps to a real failure we have watched a team hit on a real engagement. The fix is mechanical: a file or a script in the repo. None of this requires a vendor platform, a managed service, or a runtime we keep control of.

1

Per-case floor lives next to the case

Stored in eval/cases.yaml as case_floor on each row. A case that was added because of an incident gets a tighter floor than a case from the POC seed. The floor is the contract for that one case; the aggregate is reporting, not gating.

The mistake every team makes once is to ship one aggregate floor (rubric_min_score: 0.78) and call that the regression gate. With 80 cases, an adversarial case crashing from 1.0 to 0.0 (rubric_weight 1.5) buries inside an aggregate that nudges down by 0.02. The aggregate stays green, the gate stays green, and the bug ships. We have watched this exact failure mode on three engagements.

The fix is per-case floors stored in eval/cases.yaml on the row itself: case_floor: 0.78 for a typical real-traffic case, case_floor: 0.95 for an adversarial PII-handling case, case_floor: 0.74 for a known-borderline tail case the team is still hardening. The floor is the lowest score that still ships. The check_regression.py script reads the case row, not a global threshold.

Aggregate score is still computed and still posted to the PR comment so the team can see distribution-level moves. It does not block the merge. Per-case floors do.

2

Median of N runs, not the score of one run

Every case runs N times (default 3, sensitive cases 5). The case's score for the gate is the median, not the mean and not the most recent. p75 - p25 is the published spread. A case whose spread crosses stability_bound is flagged unstable.

LLM-as-judge scoring is not deterministic. Anthropic's engineering team has been explicit about this in the demystifying-evals post: temperature-zero is a request, not a guarantee, and the same prompt graded twice can land 0.04 to 0.08 apart on a 0-to-1 axis. With one run per case and a 0.78 floor, a stable 0.81 case will fail the gate roughly once per quarter on noise alone.

Three runs and median fixes most of it. Median is robust to a single bad draw. p75 - p25 is the published spread; a case with a spread above its stability_bound (default 0.10, tighter for sensitive cases) is flagged unstable in the manifest. An unstable case is either stabilized (better rubric, better must_not_include rules, drop the case if the prompt is genuinely ambiguous) or suspended from the gate. It does not silently pass on noise.

The cost ceiling is the reason N is not 50. Eighty cases at N=3 is 240 judge calls per PR. With a Sonnet-tier judge, that is roughly $1.40 per PR. We budget $5 per PR and burn most of it on the rare N=5 cases. The number that does not work is N=1; the gate that does not work is the one that flakes.

3

Quorum: last K green commits had to agree

A case fails the gate at this commit only when both: median at this SHA < case_floor, AND at least K-1 of the last K green commits had this same case >= case_floor. Single-run dips on borderline cases stop firing. Real regressions still fire.

Per-case floors and median-of-N together still leave one failure mode: a case whose true score sits at 0.79 with a 0.78 floor will dip below the floor about a third of the time even with median scoring. It is genuinely borderline. The gate that fires on every borderline dip is a gate the team turns off.

The quorum rule is: the case has to have been reliably above the floor in the immediate past. Concretely, K=5 is the default (the last 5 green commits) and we require at least 4 of those 5 had this same case above its floor. If yes, today's dip is a real regression. If no, the case is borderline by nature; the alert path is "stabilize this case or drop it from the gate," not "fail the build at random."

The quorum logic lives in 30 lines of check_regression.py and reads from the per-commit manifests on the eval-history branch. There is no separate state store. The git history is the state.

4

Manifest is committed, not stored as a build artifact

Every CI run writes eval/_history/<commit_sha>.json and the eval-bot commits it to an append-only eval-history branch. GitHub Actions artifact retention is 90 days; eval-history is forever. bisect_regression.py walks this branch.

A regression you find in October might trace back to a commit in May. GitHub Actions artifact retention defaults to 90 days. Notion docs and Slack archives are not a system of record. The cheapest reliable home for per-case score history is a long-lived branch in the same repo, written by a bot account, never force-pushed.

The eval-history branch is orthogonal to main. It only carries eval/_history/*.json. It is not merged back; it is not built from. PRs do not modify it. The eval-bot account is the only writer. Anyone with read access to the repo can clone it, walk it, and run bisect_regression.py against it from a laptop without any infrastructure.

If the team chooses not to use a branch (some monorepos make this awkward), the same JSON files can live in object storage with the commit SHA as the key. The contract is: every CI run writes one manifest, the manifests are immutable, the manifests are addressable by commit SHA.

5

Bisect is one command, not an archeology dig

When a case fails the gate, the PR comment includes 'python eval/bisect_regression.py --case-id <id>'. The script walks the eval-history branch and prints the first SHA where the case crossed below its floor. git show <sha> usually answers the rest.

The script reads every manifest in commit order, finds the last commit where this case was above its floor and the first commit where it was below, and prints both with their judge_calls and run_at timestamps. The output line that closes the loop is "Diff that broke the case: git show <sha>". On every engagement so far, that diff is the answer 80 percent of the time.

The other 20 percent of the time the diff at that SHA is a model_pin change in rubric.yaml or a judge_pin change in eval/judge_pin.yaml. Both are visible in the manifest's model_pin and judge_pin_sha fields, which is a hint we surface in the scorecard.py output: "this SHA also bumped model_primary; the regression may be model-attributable."

The point is the engineer does not start from "something dropped, where do I look." They start from "case real-014 dropped at SHA 8b21c4f, here is the diff." The PR-comment-to-fix path is minutes, not days.

eval/_history/<sha>.json: the file the gate reads

One file per CI run. Written by eval/run.py, committed by the eval-bot to an append-only eval-history branch. Every per-case score the gate considers, every SHA the run depended on, the cost of the run, and the comparison against the last green commit. Anything you would want to know to debug a regression six months from now is in this file.

eval/_history/8b21c4f3a9e0c2b71d4f5a6e2c9b0d34.json

eval/check_regression.py: per-case, with quorum

Reads the manifest the just-finished run wrote, loads the last K green manifests, and applies the two-condition rule. A case fails the gate only when its median at this commit is below its case_floor and at least K-1 of the last K green commits had this same case above its floor. Aggregate is reported but does not gate. Borderline cases that occasionally dip below their floor do not fire alerts on noise; cases that crossed for a real reason do.

eval/check_regression.py

eval/bisect_regression.py: one command resolves the alert

Walks the eval-history branch in commit order. Finds the last manifest where this case was above its floor and the first manifest where it was below. Prints both SHAs and the diff command. On every engagement so far, git show <first_below_sha> answers the question 80 percent of the time.

eval/bisect_regression.py

What the bisect command looks like in practice

A real example pulled from a recent engagement. The PR comment told the engineer that case real-014 had crossed below its 0.78 floor with median 0.69. The bisect ran for under a second and pointed at the SHA that had been merged three days earlier.

bisecting case real-014, 2026-05-07

.github/workflows/eval.yml: how the two jobs hand off

Two jobs. run-eval runs the rubric N times per case, writes the manifest, and pushes it to the eval-history branch. check-regression checks out eval-history, runs the per-case gate, and posts a scorecard with the bisect command if anything failed. The bot account is the only writer to eval-history; the source branch never touches it.

.github/workflows/eval.yml

The pattern, restated for the senior engineer who inherits this

Every case carries its own contract

case_floor and stability_bound on the row in eval/cases.yaml. A case important enough to defend is important enough to defend on its own terms. Aggregate scoring is reporting; per-case scoring is the gate.

The score of record is a median, not a single draw

N=3 by default, N=5 for sensitive cases. The cost is bounded ($1 to $2 per PR with a Sonnet-tier judge), the variance is bounded, and the gate stops being a flake-generator. p75 - p25 is published so unstable cases are visible, not silently passing.

History is the state store

eval-history is an append-only branch in the same repo. No external database. No managed service. bisect_regression.py walks git, which is the only ordering that survives rebases. A senior engineer with clone access has full regression history without an extra credential.

Nothing depends on us

Every script runs in the client's CI on the client's infrastructure. The judge can be Anthropic, OpenAI, Bedrock, Vertex, Azure OpenAI, or open-weight. There is no platform license, no vendor-attached runtime. After the Week 6 transfer session, the regression suite is the client's outright and survives us leaving.

When you would not bother with all of this

Worth saying plainly. If the agent is in week one, the cases file has 20 rows you wrote by hand, and there is no production traffic, you do not need a per-commit manifest or a bisect script yet. The order we add the regression-suite layer in: per-case floors go in week 2 with the prototype rubric (a one-line edit per case in eval/cases.yaml). Median-of-N goes in week 3 the first time we see a CI flake fail a real PR. eval/_history/ and the eval-history branch go in week 4 once we have a baseline of green commits to compare against. check_regression.py and bisect_regression.py go in week 5. By Week 6, the entire layer is in place and the senior engineer who inherits the harness can resolve a regression alert without us in the room.

By month three, a typical engagement has 250 to 400 manifests on the eval-history branch and the bisect command has resolved at least three real regressions to single SHAs. The rule of thumb on engagements: the first time the bisect resolves a regression in under a minute is the moment the team trusts the gate.

Get this regression suite in your repo by Week 6

Free scoping call. We name the senior engineer, the agent we are shipping, the case set we will mine from your traffic, and the regression-suite files (eval/_history, check_regression.py, bisect_regression.py) the runbook will leave behind for your team to own.

Regression suite, FAQ

Is a regression suite the same thing as an eval harness?

No, and conflating them is the reason most teams under-build the regression layer. The eval harness is the runner: rubric.yaml, eval/cases.yaml, the judge-model wiring, the CI workflow. The regression suite is the gate logic on top: per-case floors, the N-run median, the quorum check against the last K green commits, the per-commit manifest, and the bisect command. A team can have a perfectly good harness and still not have a regression suite, in which case every PR ships on a single judge call against an aggregate threshold and the gate flakes. The regression suite is what makes the harness load-bearing for merges.

Why median of N runs instead of mean?

Mean is sensitive to one bad draw. With N=3 and a stable case scoring 0.86, 0.84, 0.86, the mean is 0.853 and the median is 0.86. With one bad draw, scores 0.86, 0.84, 0.42, the mean is 0.707 (gate fails on noise) and the median is still 0.84 (gate stays correct). The bad draw is reported in p25 so it is not invisible, but it does not drive the gate. Median is the more robust score-of-record on the small-N regime we use for cost reasons. If you can afford N=20, mean is fine; the cost-vs-stability trade-off in practice lands you at N=3 with median.

How big should N be for sensitive cases?

We use N=5 for adversarial cases (jailbreak, PII, regulated outputs), for cases where rubric_weight is 1.5 or higher, and for any case whose case_floor is 0.95 or above. The reasoning is that those cases sit close to the ceiling, where judge variance has more room to push them under the floor on a single run, and they have outsized blast radius if a regression slips through. Cases below rubric_weight 1.0 stay at N=3. Cases that have been flagged unstable (p75 - p25 > stability_bound) get N=7 for two PRs while we either stabilize the rubric for that case or drop it. The N is a per-case attribute, not a global setting.

What is K in 'last K green commits'?

Default K=5. The check_regression.py script asks: did at least 4 of the last 5 green commits have this same case above its floor? Yes means today's dip is a real regression. No means the case is borderline by nature and the alert path is to stabilize or retire the case, not to fail the build. K=5 is small enough that a real recent fix re-establishes the case's history quickly (a green commit with the case above the floor moves into the K window after one merge), and large enough that one lucky draw three commits ago does not by itself make a borderline case look stable.

Why an append-only eval-history branch instead of a database or object storage?

Three reasons. First, it lives next to the code in the same repo, so a senior engineer who has clone access has history access without an extra credential. Second, git is durable in ways our object storage choices have outlived: every engagement we have shipped has a different blob store, every engagement still has the same git host. Third, bisect_regression.py uses git to walk commit order, which is the only stable order across rebases of the source branch. Object storage by commit SHA is a fine fallback when the monorepo refuses orphan branches, but the default is a branch.

Does this work with promptfoo or DeepEval?

Yes. promptfoo and DeepEval are case runners and judge wrappers. They produce the per-case scores that feed eval/_history/<sha>.json. Our run.py is a thin wrapper that calls promptfoo for case execution, computes the median across N runs, and writes the manifest in the shape this gate expects. The regression-suite logic (per-case floors, median of N, quorum against last K green, append-only history, bisect) is orthogonal to which runner you use. Switching runners is a one-PR change to run.py; the manifest shape, the gate, and the bisect command are stable.

What is a realistic cost per PR for this gate?

Eighty cases at N=3 with a Sonnet-tier judge is 240 judge calls per PR. Anthropic's pricing as of May 2026 puts that at roughly $1.20 to $1.60 per PR. Five sensitive cases at N=5 instead of N=3 adds another $0.20. Total: $1.40 to $1.80. We budget $5 per PR, which leaves headroom for 20 sensitive cases or a re-run after a flake-suspect alert. The number to avoid is N=20+ with a Sonnet-tier judge, which lands at $10 per PR and gets switched off the first time finance asks. Use N=3 with median, accept the bounded variance, get a gate the team trusts.

Where does this fit in a Week 6 handoff?

On every engagement, the Week 6 handoff includes a runbook, the eval harness, and the architecture doc. The regression suite layer described here lives inside the harness deliverable. The four files (eval/_history/, eval/check_regression.py, eval/bisect_regression.py, the eval-history branch wiring in .github/workflows/eval.yml) plus the per-case floor convention in eval/cases.yaml are walked through in the 90-minute transfer session. The point of the bisect command is exactly that the senior engineer who inherits the harness on month nine can resolve a regression without us in the room. The regression suite is the part of the harness that has to keep working when we are gone.

Does the gate ever block on the aggregate score?

It reports aggregate (median across all cases at this SHA) on the PR comment for visibility, but it does not gate on aggregate. We have shipped this both ways and the per-case-only gate has been categorically better. The reason is that a regression suite is meant to catch the cases that matter most: incident replays, adversarial cases, regulatory cases. Those cases lose visibility inside an aggregate. A page on this site that focuses on the case set itself, not the gate logic, is the regression eval set page; this page is its complement.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.