Guide, topic: agent regression eval set, 2026

Your agent regression eval set is shrinking backward and the lint that catches it is 80 lines.

Most articles on this topic walk you through a cases file, a rubric file, and a CI job. All three are correct. None of them describe the failure mode that eats production agents over months: the eval set itself losing cases. An engineer cleans up old POC rows. Another drops cases tied to a churned account. A third deletes whatever the rubric refactor made awkward. Six months later the set is 30 percent smaller and nobody noticed. This guide is the inverse. It documents the per-case provenance schema we ship in eval/cases.yaml, the eval/cases_removed.yaml deletion ledger, the 80-line scripts/lint-cases.py pre-commit hook that fails the build when a case ID disappears without a recorded rationale, and the CODEOWNERS lines that put a human on every removal. Pulled from five named production agents.

Matthew Diakonov, PIAS, forward-deployed ML engineering

Published April 27, 202614 min read

4.9from Same ratchet shipped across 5 named production agents

Five enumerated provenance sources: poc, tail, inc, cust, sec

Four enumerated removal rationales; no free-text 'no longer relevant'

scripts/lint-cases.py runs as a pre-commit hook AND a required CI gate

Agent regression eval set, with a ratchet rule

The half most articles never name

Your eval set should never shrink silently

Every case carries source + incident_url

eval/cases_removed.yaml: the deletion ledger

scripts/lint-cases.py: the ratchet hook

Case IDs greppable to a real production failure

0:00 / 0:08

Same eval-set contract across Pydantic AI, LangGraph, custom orchestration, an automated ML pipeline, and a multi-model DAG.

Monetizy.ai / Upstate Remedial / OpenLaw / PriceFox / OpenArt

Audit your regression eval set with an engineer

eval/cases.yamleval/cases_removed.yamlscripts/lint-cases.py.github/CODEOWNERSprovenance.sourceprovenance.incident_urladded_in_prremoved_in_prrationale: supersededrationale: rubric_changerationale: false_positivecase_id: poc-001case_id: tail-2026W17-004case_id: inc-2026-04-21-002case_id: cust-monetizy-2026-W11-003case_id: sec-2026-Q1-001

What every other guide on this misses

Open the guides on this topic. They cover three artifacts: a YAML cases file with input plus expected pairs, a rubric file with thresholds, and a CI job that grades the agent against the cases on every PR. All three are fine, none of them are wrong, and zero of them tell you what stops the cases file from shrinking when nobody is looking. Eval sets do not rot by gaining bad cases; they rot by losing good ones.

The pattern we see, across every engagement we walk into around month four, is the same. Someone cleaned up the POC cases that "feel old". Someone dropped the customer cases tied to a churned account. The rubric got refactored and the cases that were hard to migrate quietly disappeared in the same PR. Each removal felt small. The net effect is a regression set that has lost 20 to 40 percent of its coverage with no audit trail. The next model swap regresses on something the original set caught a year ago, and the team rediscovers a known failure mode in production.

The fix is not more cases. It is a ratchet rule. The set is allowed to grow on every PR. The set is allowed to shrink only via a row in a separate ledger file, with one of four enumerated rationales, approved by the engineering lead. A pre-commit hook and a CI gate enforce both halves. The set never has to be remembered; the file is the memory.

Anchor: the ID schema and the five provenance sources

Every case in eval/cases.yaml carries a provenance.source set to one of five values. The ID prefix matches that value, so the source is visible at every grep, every PR diff, every dashboard. The ID schema is enforced by a regex in scripts/lint-cases.py, so it does not erode under three engineers and eighteen months.

POC seed cases

The first 60 to 120 cases. Hand-written during the week-2 scoping call with the product lead. Cover the golden path and the named near-misses we already know about. ID prefix: poc-NNN. provenance.source: poc. provenance.incident_url: null. These are the cases the rubric was actually designed against.

Tail-mined cases

Cases the weekly tail miner cluster-extracted from the last 7 days of low-score production traces. ID prefix: tail-YYYYWWW-NNN. provenance.source: tail_miner. provenance.incident_url: a link to the trace cluster. They land in eval/cases.yaml only after the engineering lead reviews and merges the tail PR.

Incident cases

One case per real production incident. ID prefix: inc-YYYY-MM-DD-NNN. provenance.source: incident. provenance.incident_url: the post-mortem doc. The rule on the team is: no incident is closed until at least one regression case is in this file. The incident link is in the YAML, not in tribal memory.

Customer-reported cases

Cases pulled from a customer support ticket where the agent got the answer wrong. ID prefix: cust-<account>-YYYY-WWW-NNN. provenance.source: customer. provenance.incident_url: the support ticket. The case is added the day the ticket lands, not the day the next eval cycle starts.

Security and red-team cases

Cases that came out of a security review or red-team pass. ID prefix: sec-YYYY-Qx-NNN. provenance.source: security. Some are public (prompt injection patterns); some are private and live in eval/cases_security.yaml under a separate CODEOWNERS rule. Either way they ratchet the same way: they cannot disappear without a row in eval/cases_removed.yaml.

eval/cases.yaml, with provenance per row

One row per case. The id encodes the source. The provenance block encodes the origin event. The rubric_axes list names which rubric lines this case asserts on. severity flags critical and blocker cases so the rubric weights them above standard cases. Five real cases shown below, one per provenance source.

eval/cases.yaml

eval/cases_removed.yaml, the deletion ledger

Every case ever removed from eval/cases.yaml has a row here. rationale is one of four enumerated values. removed_in_pr links the PR that dropped the case. approved_by records the engineering lead. The note field is freeform context. Free text alone without the enum is rejected by lint-cases.py.

eval/cases_removed.yaml

scripts/lint-cases.py, the ratchet hook

Eighty lines of Python. Reads cases.yaml and cases_removed.yaml from HEAD and from the merge base with origin/main. Diffs the case-ID set. If any ID disappeared from cases.yaml without a matching row appearing in cases_removed.yaml, exits non-zero. Also enforces the ID regex and the per-source required fields. Runs as a pre-commit hook AND as a required GitHub Actions check, so the merge gate cannot be skipped by an engineer who disabled the local hook.

scripts/lint-cases.py

.github/CODEOWNERS, the human in the loop

The lint without CODEOWNERS is a guard a tired engineer disables with --no-verify or a force push. CODEOWNERS pulls the engineering lead into the review automatically. The lint and the human review together are the ratchet, not the lint alone.

.github/CODEOWNERS

The numbers that govern the set

Four parameters. Each is a deliberate choice; we have moved each of them on at least one engagement and re-pinned at the new value with a one-line note in the runbook.

0Distinct provenance sources, each with its own ID prefix and required fields

0Enumerated rationale values for any removal: superseded, rubric_change, false_positive, out_of_scope

0Lines of Python in scripts/lint-cases.py; runs as pre-commit and as a CI gate

0Cases that should leave the set without a row in eval/cases_removed.yaml

provenance sources

removal rationales

lines of lint

silent removals tolerated

How the schema is structured, in one card

The ID is the user interface. The provenance schema is the contract underneath. The removal enum is the audit trail. CODEOWNERS is the human gate. Each piece is small; together they are the only thing that prevents the regression set from rotting.

ID schema

<provenance>-<YYYY>W<week>-<NNN> for time-bound sources. <provenance>-<account>-<YYYY-WWW>-<NNN> for customer cases. <provenance>-<YYYY>-Qx-<NNN> for quarterly security passes. The prefix is the search key; the date is the sort order; the number is the within-period index. Greppable in 200ms across years of history.

Provenance schema, one of five values

poc, tail, inc, cust, sec. Each value implies a different incident-link shape and a different required-field set. The lint enforces it. The IDs are the user interface; the schema is the contract underneath.

Why removals get an enumerated rationale, not free text

We learned from one engagement that 'no longer relevant' was hiding three different things: a sharper case had replaced this one (superseded), the rubric axis itself had changed (rubric_change), or the case was simply wrong (false_positive). Different futures for each. The enum forces the team to name which one.

Why CODEOWNERS, not just a lint

scripts/lint-cases.py without CODEOWNERS is a guard a tired engineer disables with --no-verify or a force push. CODEOWNERS pulls the engineering lead into the review automatically; the lint and the human review together are the ratchet, not the lint alone.

What the lint does when an engineer tries to skip the ledger

A real session. The engineer drops a poc-052 row from eval/cases.yaml because it "feels old" and forgets to add the corresponding ledger entry. The hook blocks the commit. The error message names the violation, names the missing rationales the engineer must pick from, and exits with a non-zero code.

$ git commit -m 'cleanup' -- pre-commit lint blocks the deletion

Where every case in the set comes from

Five inputs feed eval/cases.yaml on the left. Three outputs come out the right. The set in the middle is CODEOWNERS-pinned and ratchet-protected; nothing leaves it without showing up on the bottom-right output.

five sources -> eval/cases.yaml -> rubric runs + audit trail

The lifecycle of one case, from incident to ledger

Six steps. Same shape on every engagement. The thing that changes is which provenance prefix the case carries; the contract underneath is identical.

A real input fails. Where the case enters the set is determined by who saw it first.

If a customer raises it, it is a cust- case and the support engineer files it that day with the ticket URL. If the tail miner clustered it, it is a tail- case and lands via the Monday PR. If on-call paged on it, it is an inc- case and is required before the post-mortem closes. If red team probed it, it is a sec- case. Origin is encoded in the ID, not lost to memory.

The provenance contract is filled in BEFORE the case lands.

scripts/lint-cases.py refuses to merge if a case is missing the per-source required fields: account + incident_url for cust-, cluster_size + incident_url for tail-, incident_url for inc-, reviewer for sec-. The PR cannot be opened with TODOs. The author writes the URL, the lint enforces it.

The case runs against model_primary on every PR until removed.

Once merged, the case is a row in the regression set. Every PR's CI run grades the agent against it. Every model qualification PR re-grades it against the candidate model. The case will outlive the engineer who wrote it; the URL pinned in provenance.incident_url is what makes that survivable.

Removal is a deliberate act with a recorded rationale, not a cleanup.

When the team decides a case no longer belongs (the rubric axis changed; the product froze a feature; a sharper case covers the same assertion), an engineer opens a PR that drops the row from eval/cases.yaml AND adds a row to eval/cases_removed.yaml with one of the four enumerated rationales. The CODEOWNERS rule pulls the engineering lead into the review automatically.

Reconstruction is one grep away.

Six months later, when a customer asks 'do you still test for X', the on-call engineer runs grep on cases.yaml and cases_removed.yaml. If X is in cases.yaml, the answer is yes and here is the case; if X is in cases_removed.yaml, the answer is 'we used to, here is when we stopped, here is why, here is the PR'. Memory is in YAML, not in the engineer who left.

The set ratchets up, never down. That is the whole point.

Net case count over time should be monotonically non-decreasing on a long enough window. A scheduled job posts a chart of |cases.yaml| - |cases_removed.yaml| each Monday in the engineering Slack. If the curve ever bends down two weeks in a row, the engagement owner is paged. An eval set that loses cases is a bigger smell than an eval set that grows.

The audit, before any model swap

Before pinning a new model_primary, the engineering lead walks this checklist against the repo. Every line is a one-shot grep or an artifact you can show in a PR comment. If any row fails, the model qualification PR cannot merge until it is repaired.

eval/cases.yaml audit, run before every model swap

Every case in eval/cases.yaml has provenance.source set to one of poc, tail, inc, cust, sec
Non-poc cases include provenance.incident_url that resolves to a real document
Case IDs match scripts/lint-cases.py's regex; no ad hoc names
eval/cases_removed.yaml contains a row for every case ID that was ever in cases.yaml on a previous main commit
Rationale on each removal is one of the four enumerated values, never free text only
.github/CODEOWNERS pins eval/cases.yaml and eval/cases_removed.yaml to the engineering lead
scripts/lint-cases.py runs as a pre-commit hook AND a required GitHub Action
Critical and blocker severity cases are tagged in YAML and the rubric weights them above standard cases
Customer-reported cases are added the day the ticket lands, not at the next eval cycle
Security cases live in cases_security.yaml with a stricter CODEOWNERS rule when their inputs are sensitive

Why this matters on the next model swap

The set you ship is the set you re-grade against, three months from now, against a model that did not exist when you wrote it.

Every cust- case is a customer who once had a bad day. Every inc- case is a production incident the team paid for. Every sec- case is a red-team finding that used to work. If those rows are silently absent when claude-4-8 ships, your qualification PR is grading the new model against a smaller, gentler set than the one your previous model passed.

The ratchet is not a process nicety. It is the thing that makes the next model swap safe. The line of Python that prints "RATCHET VIOLATION" on a PR is the line of Python that prevents a known regression from re-entering production.

0 silent removals tolerated

“The eval set lost cases between v1 and v3. We did not notice until the qualification PR for the next model passed and a customer hit a regression we used to test for.”

anonymized engagement intake, 2026

Side by side: ratcheted set vs the typical "directory of YAML cases"

Left: the shape we ship as a 6-week leave-behind. Right: the shape we walk into on month-four engagements where the eval set was assembled organically. Neither is wrong on day one. The ratchet is what makes the left column survive past month six.

Feature	Directory-of-YAML-cases (the typical shape)	Ratcheted regression eval set (PIAS shape)
Per-case provenance	A flat directory of YAML cases with input + expected. No source field. No way to tell whether a case came from a real customer ticket or from the engineer guessing what users might say.	Five enumerated source values (poc, tail, inc, cust, sec). Each requires specific fields, and scripts/lint-cases.py fails the build if they are missing. The ID prefix tells you the source on sight.
What happens when a case is deleted	Cases come and go in commits. Six months later nobody knows whether the agent is still tested for the regression that triggered the case. The set silently loses coverage.	scripts/lint-cases.py diffs cases.yaml against origin/main, and refuses to merge if any case ID disappeared without a matching row in eval/cases_removed.yaml. Every deletion has an enumerated rationale and an engineering-lead approval.
Linkage to real production failures	Cases reference a Jira ticket in a code comment, if that. The ticket eventually 404s. The case becomes orphan documentation.	provenance.incident_url is a required field on tail-, inc-, cust-, and sec- cases. It points to the trace cluster, the post-mortem, the support ticket, or the red-team report. Two clicks from the case to the originating event.
Where the eval set lives	Inside a vendor dashboard (LangSmith, Braintrust, Galileo) or a private spreadsheet. The harness is portable; the set is not. The set leaves with the vendor renewal.	Plain YAML in your repo. CODEOWNERS-pinned to the engineering lead. PR-reviewed. Survives the engineer leaving, the SaaS contract churning, the framework swap.
Customer-reported failures	An engineer copy-pastes the failing input into a Slack message and a manual prompt-tweak ships next sprint. The case never enters the set; the next model regression hits the same customer.	Filed as cust-<account>-YYYY-WWW-NNN the day the ticket lands. provenance.account names the affected account. provenance.incident_url is the ticket. Re-grades against every model swap forever.
Audit on 'do you still test for X'	An engineer 'thinks so', pulls up an old branch, hunts through Jira, asks the previous PM. The honest answer is 'we do not know'.	grep eval/cases.yaml eval/cases_removed.yaml. Either X is currently tested (here is the case), or it was tested and we stopped (here is the rationale, here is the PR, here is who approved). One command.
Ratchet rule	No ratchet. Eval sets shrink quietly during quiet weeks. Coverage erodes over months, not days. The team only notices after a regression slips past in production.	Net case count is monotonically non-decreasing on a 4-week window. A Monday cron posts \|cases.yaml\| - \|cases_removed.yaml\| in Slack and pages the engagement owner if the curve bends down two weeks in a row.
Engagement leave-behind	A configured workspace inside a SaaS dashboard with a renewable license. The 'eval set' is two API calls and a UI. When the contract lapses, so does the set.	Five files: eval/cases.yaml, eval/cases_removed.yaml, scripts/lint-cases.py, .github/CODEOWNERS lines, .github/workflows/eval.yml. All in your repo on main. Model-vendor neutral. No platform license. The senior engineer leaves; the contract stays.

Want a senior engineer to ratchet your regression set?

Twenty minutes. We walk your eval/ folder, name the silent removals on origin/main, and sketch the lint hook against your repo's actual shape.

Frequently asked questions

What is an agent regression eval set, and how is it different from a generic test set?

A regression eval set is the YAML file (or files) of input + expected pairs that you run the agent against on every PR, every model swap, and every prompt change. The 'regression' modifier matters: every case in the set is here because something once failed, or because something is known to fail in a class we want to catch. It is not 'tests we wrote because tests are good'; it is 'tests we wrote because reality already burned us once and we want the next swap to surface that same failure mode before it ships'. The shape we ship is one cases.yaml, one cases_removed.yaml, one rubric.yaml, one runner, and one CI gate. The differentiator from generic test sets is per-case provenance: every row carries a source field (poc, tail, inc, cust, sec) and an incident_url, so 'why is this case here' is one click away forever.

Why is the deletion ledger (eval/cases_removed.yaml) load-bearing?

Because eval sets rot in a specific direction: backward. They quietly lose cases over months. An engineer cleans up 'old POC cases' before a refactor; another deletes a row that the rubric change made hard to score; a third drops customer cases tied to a churned account. Six months later the set is 30 percent smaller and nobody noticed. The deletion ledger inverts that: every removal has a row, an enumerated rationale, an approved_by line, and a PR number. scripts/lint-cases.py refuses to merge a PR that drops a case ID without the matching ledger row. The set still loses cases when the team decides it should, but the loss is auditable, justified, and reversible. Six months later 'do we still test for X' is a one-line grep.

What does the case-ID schema look like, and why does it matter?

The schema is <provenance>-<period>-<NNN>. Concrete examples: poc-001, tail-2026W17-004, inc-2026-04-21-002, cust-monetizy-2026-W11-003, sec-2026-Q1-001. The prefix is one of five enumerated values, the period is either an ISO week, an ISO date, or a quarter depending on the source, and the NNN is the index within the period. The point is that the ID is the index. You can grep eval/ for poc- and see every hand-written POC case. You can grep for inc-2026-04 and see every incident in April. You can grep for cust-monetizy- and see every case Monetizy.ai's tickets contributed. The schema is enforced by an 80-line regex in scripts/lint-cases.py, so the indexability does not erode over time.

Why five provenance sources specifically? Why not just a free 'origin' string?

Free strings drift. After three engineers and 18 months you have 'PoC' and 'poc' and 'POC' and 'product-call' and 'workshop', and grep stops working. The five values (poc, tail, inc, cust, sec) cover every real source we have seen across five named production agents on Pydantic AI, LangGraph, custom orchestration, an automated ML pipeline, and a multi-model DAG. POC cases come from the week-2 scoping call. Tail cases come from the weekly miner. Incident cases come from production post-mortems. Customer cases come from support tickets. Security cases come from red-team passes. If a sixth source ever appears in practice, it is a deliberate change to the lint regex, not an ambient drift in the YAML.

How does scripts/lint-cases.py actually catch a silent deletion?

It is roughly 80 lines of Python. It runs git show origin/main:eval/cases.yaml and HEAD:eval/cases.yaml, parses both with yaml.safe_load, takes the symmetric difference of the case-ID sets, and computes silently_dropped = (base_cases - head_cases) - (head_removed - base_removed). If silently_dropped is non-empty, it prints the offending IDs and exits with code 2. The hook also enforces the case-ID regex on every row in HEAD's cases.yaml and the per-source required-field schema (incident_url required for non-poc sources, account required for cust, reviewer required for sec). It runs both as a pre-commit hook and as a required GitHub Actions check, so a developer who skips the local hook still gets blocked at the merge gate. CODEOWNERS pins eval/cases.yaml to the engineering lead, so the hook cannot be turned off in a branch without that review.

What are the four enumerated rationales for removal, and what do they each mean?

superseded means a sharper case covers the same assertion (eg, a hand-written POC case got replaced by a tail-mined real production input that asserts the same thing more concretely). rubric_change means the rubric axis the case asserted on no longer exists in its old shape, and the case cannot be mechanically migrated; the new equivalent is filed under a fresh ID. false_positive means the original case's expected field was wrong; the regression we thought we were preventing was not actually a regression, and the case is removed AND replaced with a corrected one under a -b suffix so the audit trail survives. out_of_scope means a deliberate product decision narrowed the agent's scope (eg, multilingual was deferred to next year), and the case is preserved in the ledger so when scope expands again the regression history is reconstructable. The four are exhaustive on every engagement we have shipped; if a fifth ever lands, that is an explicit lint-cases.py change reviewed by the engineering lead.

How does this fit with the harness and the tail miner?

The harness reads eval/cases.yaml and eval/judge_prompts.yaml and runs both deterministic and LLM-judge axes against every case (covered in detail on the LLM agent eval harness page). The tail miner clusters low-score production traces every Monday into eval/tail/<ISO-week>.yaml and opens a PR (covered on the AI POC to production regression tail page). This page is the layer between them: the contract that says cases coming in from any of those sources land in eval/cases.yaml with provenance, never disappear without a ledger row, and stay greppable forever. The harness grades. The miner discovers. The set is the durable record of what we promise to never regress on.

Does the same shape work for non-text agents (vision, audio, code)?

Yes. The cases.yaml schema is mode-agnostic; only the input and expected fields change shape. Code outputs add cases like 'this prompt should produce code that compiles + passes unit_test_X.py', filed under poc-/tail-/inc- the same way. Vision adds cases like 'this prompt should produce an image whose hash is in this set' or 'this prompt should produce an image where the face is in the lower-left third'. Audio adds transcript-based and feature-based cases. The provenance schema is unchanged. scripts/lint-cases.py is unchanged. The deletion ledger is unchanged. The OpenArt multi-scene video agent we ship runs the same lint hook against eval/cases.yaml as a text-only legal-compliance agent does on Upstate Remedial.

What is the leave-behind on a 6-week engagement that includes the regression eval set?

Five files in your repo on main: eval/cases.yaml (the set itself, with provenance per case), eval/cases_removed.yaml (the deletion ledger), scripts/lint-cases.py (80 lines, no deps beyond pyyaml), the .github/CODEOWNERS lines that pin all three to the engineering lead, and the .github/workflows/eval.yml that runs lint-cases.py + the rubric on every PR. Plus a paragraph in ops/failure_playbook.md describing what the four removal rationales mean and how to add a new provenance source. The named senior engineer leaves; the lint hook stays; the ratchet rule keeps the set growing past their tenure. Model-vendor neutral, no platform license, no PIAS-hosted runtime; you can swap us out and the set keeps doing its job.

What happens if the team genuinely needs to mass-delete cases (eg, a major rubric refactor)?

It is one PR, not a hundred. The PR drops the cases from eval/cases.yaml in a single commit and adds a corresponding bulk entry to eval/cases_removed.yaml: a list of removed IDs with rationale: rubric_change and a single rubric_change_pr field pointing at the rubric refactor PR. CODEOWNERS pulls the engineering lead in, the lint passes (every dropped ID has a ledger row), and the set updates as one auditable move. The new equivalent cases are filed under fresh IDs with rationale: rubric_change in their note field. Two months later anyone can reconstruct exactly what was deleted, why, and where the replacements live. We have run this exactly twice across five named production agents; both times the audit trail was the only reason the next model qualification PR did not silently regress on the dropped cases.

The harness, the miner, and the two-clock cadence

Related guides

Eval harness

LLM agent eval harness, the judge layer most guides skip

The 70/30 deterministic vs LLM-judge split, the eval/judge_prompts.yaml file pinned by SHA, and the calibration cron that keeps the scoreboard honest as the judge model itself moves.

15 min readRead

Regression tail

AI POC to production: the regression tail and the weekly PR that absorbs it

The shape of a tail miner that turns the last 7 days of low-score production traces into a Monday PR against eval/cases.yaml, with the file names, the cron, and the thresholds that make it load-bearing.

13 min readRead

Production evals

Production AI agent evals: the two-clock system

Pre-merge clock and post-merge clock, why both are required, and how the Friday loop keeps the eval set honest about what production traffic is actually doing to the agent.

12 min readRead