Guide, topic: building and gating on a failure dataset for production agents

Your aggregate eval set passes at 95 percent. Your users keep hitting the same five percent.

Almost every guide on agent eval describes failure cases in the abstract: "test for them, log them, retest." Almost none of them ship a file. The failure dataset is the file. It is a deliberately failure-only YAML corpus, scored on its own 100 percent gate, separate from the broader eval set. This is the shape we leave behind at week 6.

M
Matthew Diakonov
14 min read

Direct answer, verified 2026-05-08

An AI agent failure dataset for evaluation is a curated corpus of agent runs that already failed in production, kept in a dedicated file (we ship it as evals/failures/cases.yaml), with each row tagged by failure_mode, severity, blast_radius, and a remediation history. It is scored separately from the broader eval set on a hard 100 percent CI gate: a single regression against any case blocks the merge. The dataset stays small (60 to 200 cases), grows by promotion from a trace-mining intake pipeline, and shrinks only via a recorded deletion ledger. It is the institutional memory of every prior incident, in a form CI can read.

0Failure modes in the taxonomy
0%CI pass rate required on the failure dataset
0Regression on any case = blocked merge
0 dayTrace ingestion window per cycle

Why the broader eval set is not enough

The aggregate eval set is the right tool for bake-offs, model swap qualification, and trend tracking. It is the wrong tool for gating production deploys against known prior incidents. Two reasons.

First, distribution. The aggregate set is meant to mirror traffic, which means common intents dominate it and tail failures get one or two cases out of hundreds. A regression on a P0 tail case hides inside two new passes on common cases and the aggregate score does not move enough to fail a 0.94 floor. We have seen this exact pattern ship: aggregate score steady at 0.95, P0 case AGTFAIL-0042 quietly red, customer escalation two days later asking why the same refund flow broke again.

Second, intent. The aggregate set asks "is this version broadly good?" The failure dataset asks "is this version still safe against every failure we have already paid for?" Both questions are real, neither answers the other. Running them as one gate forces a tradeoff that should not be a tradeoff: any regression against a curated prior failure is not negotiable, even if the rest of the system got better.

The failure dataset exists because the aggregate set statistically smooths over the cases that hurt most. We split them into two files, two gates, one judge.

The file shape we ship

The whole thing is one YAML file with a header, a fixed failure_mode taxonomy, and an append-only list of cases. The two cases below are anonymized but structurally identical to cases we have shipped into client repos. Read the schema.

evals/failures/cases.yaml

The seven failure modes, and why the taxonomy is closed

Every case carries exactly one failure_mode tag from a fixed list of seven: tool_invocation, retrieval_miss, hallucination, multi_turn_drift, refusal_overreach, latency_breach, blast_radius_breach. The taxonomy is closed on purpose. A free-form tag soup makes the per-mode pass rate unreadable, and the per-mode pass rate is the diagnostic that turns a red CI run into a fix.

When a CI run fails, the engineer on the fix sees this:

ci.log

The shape of the matrix tells the engineer where to look before they open the diff. Two regressions in retrieval_miss point at the index, the chunker, or a stale embedding cache. One regression in blast_radius_breach points at a tool gateway change. With free-form tags none of this signal survives the rollup.

How a failure gets into the dataset (the intake pipeline)

Five steps, run on a cron and reviewed by a named owner. Nothing auto-merges. The whole pipeline takes about 45 minutes of one engineer's time per cycle, weekly.

1

Step 1: sample fails out of the trace store, do not sample successes

We pull the last 7 days of traces, score them with the same rules + LLM judge the harness uses, and keep only the runs that failed. Successes are eval-set fodder. The failure dataset is a deliberately biased corpus. The whole point is to overweight the failures because aggregate pass rate already overweights the easy cases.

Stratify by intent before you sample. A retrieval product where 90 percent of traffic is one intent will dominate the failure dataset with that intent unless you bucket first. We typically cap any one intent at 30 percent of the failure set and force the rest to come from tail intents, even if it means upsampling rare failures. Aggregate pass rate already smooths over the tail. The failure dataset exists to stop doing that.
2

Step 2: dedupe by input_hash, then by (failure_mode, observed embedding)

Two cases that hash to the same redactor-stable input are the same case; keep the older one. Two cases with the same failure_mode whose observed_behavior embeddings cosine above 0.92 are likely the same shape. We keep one, link the other into the prior case's discovered_by ledger as a recurrence. The failure dataset is small on purpose.

A 60 to 200 case failure dataset is enough to catch the regressions that matter; 1,200 cases is a sign the dedupe pass is broken or someone is conflating the failure dataset with the full eval set. Each case must earn its slot. If you cannot say which prior incident or trace it represents, drop it.
3

Step 3: classify into the seven failure modes (or extend the taxonomy)

Every promoted case gets exactly one failure_mode tag. If a case does not fit one of the seven, the named engineer either re-reads the case (most do fit, with effort) or proposes an eighth mode in a PR that updates the schema_version. Adding a mode is rare; we have done it twice in 2025 and once in 2026 to date. The taxonomy is load bearing.

Why exactly seven, and not a free-form tag soup: the per-mode pass rate is the diagnostic that turns a red CI run into a fix. If 78 of 80 cases pass and the two failures are both in retrieval_miss, the engineer fixing the regression knows the right file to open before reading the diff. With free-form tags this signal collapses into noise.
4

Step 4: assign severity and blast_radius from a fixed list, not narrative

Severity is P0, P1, P2, or P3. Blast_radius is one of: customer_visible, internal_visible, irreversible_write, reversible_write, no_external_effect. These two fields decide whether a regression on this case is hot enough to page versus chat versus log. Free-text severity ("high", "critical", "important") drifts within a quarter and the on-call who has to act on a 3am alert cannot tell the difference.

The blast_radius field is the one most teams skip and regret. A retrieval miss that recommends a slightly worse article is not the same incident as a tool call that sent a real email to a real customer. The first is a P2; the second is a P0 even if the agent was "mostly right" otherwise. Encoding this on the case, not on the alert, means the gate fires consistently across teams.
5

Step 5: record the remediation history on the case itself, not in tickets

Every fix that touches a case writes a dated bullet under remediation. The case carries its own history: when it failed, what we changed to make it pass, when it last passed in CI, and the regressed_count. This is the institutional memory that makes the same incident not happen twice. Tickets close. Slack threads scroll. The YAML stays.

The regressed_count field is a quiet tell. A case that has regressed three times in six months is structurally fragile in a way the per-fix patch never fully addressed. We treat regressed_count > 1 as a flag for a deeper architectural fix, not another point patch. Ten of these flags, taken together, usually point at one or two real systemic weaknesses worth a week of rearchitecture.

What the intake script actually does

The intake step pulls failed traces out of the trace store, grades them with the same judge the harness uses, dedupes against existing failure cases, and writes a review file the named owner reads by hand. Nothing is automatic past the grading step.

failures:ingest

Promotion is the only way in

A candidate case lives in evals/failures/_intake/ until a named engineer reads it, fills in the missing fields, and promotes it. The promotion script validates the schema, runs the dedupe pass one more time, runs the case against current main to confirm the failure is reproducible, and appends both the case and a history record.

failures:promote

The 100 percent CI gate, and why two gates beat one

CI runs both gates on every PR. Different sets, different thresholds, same judge snapshot.

.github/workflows/eval-gates.yml

The aggregate gate has a soft floor (0.94 in this excerpt; tune to your domain). The failure dataset gate is a hard 1.00 with one extra rule: zero regressions against history. A case that flipped from pass to fail since the last main commit is blocked even if the absolute pass rate is still 1.00, because the history file caught the flip. That second rule is the one most teams skip and most teams need.

Failure dataset vs the broader eval set, side by side

They share a directory. They share a judge. They serve different jobs. Conflating them is the most common cause of shipped regressions on cases the team had already "tested for."

FeatureAggregate eval setFailure dataset
Selection biasMixed. The set is a snapshot of "representative traffic" or "things the team thought to write down," with successes and failures intermixed. Aggregate pass rate dominated by easy cases.Deliberately failure-only. The dataset is a curated corpus of inputs that we know broke the agent at least once. Nothing in it should ever pass and stay added.
CI thresholdSoft floor (typically 92 to 96 percent aggregate pass rate). A regression on a P0 case can hide behind two new passes on easy cases.Hard 100 percent. Any single regression on any case blocks merge, no exceptions. The bar is not a number to optimize, it is a contract.
Per-row metadataFree-text description plus expected_output. No tagging, no incident link, no remediation log on the case itself.failure_mode, severity, blast_radius, discovered_at, discovered_by, remediation history, regressed_count, owner. Every field is structured.
Growth patternHand-edited. Cases come and go silently. The set five quarters in often does not include the cases the team needed it to include.Append-only by promotion. A case enters via the ingestion pipeline and a named owner. Removal requires a recorded rationale in cases_removed.yaml.
Diagnosis on redAggregate score moved. The engineer reads logs to figure out which case failed, opens the case, and reverse-engineers the failure mode each time.Per-failure-mode pass rate. The engineer fixing a red CI run sees which mode regressed (retrieval_miss, blast_radius_breach, etc.) before opening the diff.
Coverage of the long tailLow. The aggregate set's distribution roughly matches usage, which means rare-but-painful failure modes get one or two cases out of hundreds.High. Most rows are tail-intent failures that aggregate eval barely sampled. The dataset is structurally biased toward the cases the eval set undercounts.
What it is good forBake-off comparisons, dashboard tracking, model selection on aggregate. Less suited to gating because a P0 regression hides in the noise.Pre-deploy gate, model swap qualification, on-call diagnosis, post-incident learning loop closure.
3 silent regressions in 90 days

We had 200 cases in the aggregate eval set. Our P0 refund flow regressed three times in a quarter and we missed each one until a customer told us. The aggregate score barely moved.

anonymized engagement intake, 2026

What lives next to the dataset (the rest of the leave-behind)

The failure dataset is one file in a small directory. The other files exist to keep it honest:

  • evals/failures/cases.yaml — the dataset itself.
  • evals/failures/cases_removed.yaml — append-only deletion ledger. Removing a case requires a line here with a reason and the engineer who removed it.
  • evals/failures/history.jsonl — append-only event log of every status change, every promotion, every regression. The CI gate reads this file to detect flips.
  • evals/failures/_intake/ — directory of pending review files dropped by the cron. Empty most of the week.
  • scripts/run_failure_eval.py — the runner CI calls. About 200 lines.
  • scripts/lint-failures.py — pre-commit hook. Fails the build if a case ID disappears from cases.yaml without a matching entry in cases_removed.yaml.
  • .github/CODEOWNERS — entry that puts the engineering lead on every PR that touches anything under evals/failures/.

None of these files is fancy. The discipline is in the structure, not the tooling. A team that treats the directory as load bearing will keep it healthy with a few hundred lines of Python; a team that treats it as a YAML graveyard will rot it in a quarter regardless of which framework they installed.

Need this shipped into your repo by week 6?

A senior engineer joins your repo, lifts your last 90 days of incidents into a real failure dataset, wires the 100 percent CI gate, and hands off the runbook. You keep the file and the eval harness.

FAQ

Failure datasets, the 100 percent gate, and the intake pipeline, answered

How is a failure dataset different from a regression eval set or a gold set?

A regression eval set is the broader corpus of cases used for bake-offs and trend tracking; cases in it can pass or fail and the threshold is aggregate. A gold set is the labeled reference subset used to grade judges and detect drift. A failure dataset is the narrowest of the three: every row is a known historical failure, the gate is 100 percent, and the only purpose is to prevent the same incident from happening twice. They live in the same repo (evals/), they share a judge, and they run in the same CI; the gates are deliberately different.

How big should the failure dataset get before it stops being useful?

Most engagements stabilize between 60 and 200 cases. Past 200, dedupe is usually broken or the team has stopped pruning. Cases that have not failed in 9 months and whose underlying flow has been rewritten are good removal candidates: move them to cases_removed.yaml with a reason, do not silently delete. The metric to watch is regressed_count per case; cases with regressed_count above 1 are doing real work, cases with regressed_count = 0 for two quarters are candidates for review.

What goes in the seven failure modes and why exactly seven?

tool_invocation, retrieval_miss, hallucination, multi_turn_drift, refusal_overreach, latency_breach, blast_radius_breach. Seven because that is what naturally clustered across our shipped engagements; nothing magical about the count. We have extended it twice in 2025 and once in 2026 with new modes (one was tool_argument_truncation, one was retrieval_index_stale, one was prompt_injection_uncaught). Adding a mode requires a schema_version bump and a one-paragraph rationale in the PR. Removing one is rare; we have not removed any.

Does the failure dataset replace the broader eval set?

No. The aggregate eval set is still the right tool for bake-offs, model swaps, and trend tracking. The failure dataset gates deploys against the specific cases that have already burned the team. They are complementary. A typical CI run executes both: aggregate set on a 0.94 floor for bake-off and trend, failure dataset on a 1.0 floor for shipping. Two gates, one judge.

How do customer-reported failures get into the dataset?

Same intake pipeline as production traces, with one extra step. The escalation ticket gets pulled into evals/failures/_intake/, the on-call who triaged the incident drafts the case YAML and tags themselves as discovered_by, and the named owner promotes it. We require the customer-reported case to be reproducible from input alone before it can gate CI; if reproduction needs a specific store state, we capture that as a fixture and reference it under input. Cases that cannot be made deterministic stay in a watchlist file but do not gate.

Why is the gate 100 percent and not 99 percent?

Because the dataset is small (60 to 200 cases) and curated. A 99 percent gate on 100 cases means you can ship with one regression on a known prior incident. There is no scenario where shipping a regression on a case the team specifically curated as gating is a sound tradeoff; if a case is not gating, the right move is to mark it status: deprecated and let it fall out of the gate, not to ship around it. The 100 percent rule forces that decision to be deliberate and recorded, not opportunistic.

How do you keep this from rotting backward like every other YAML file in a repo?

The same lint and history protocol we ship with the broader regression eval set: a pre-commit hook that fails the build if a case ID disappears without a recorded rationale in cases_removed.yaml, a CODEOWNERS rule that puts the engineering lead on every deletion, and an append-only history.jsonl that records every status change. The failure dataset shrinks only with a paper trail. Most teams that adopt this pattern find their evals/ directory becomes the single most stable file structure in the codebase.

What does the failure dataset look like at week 2 of an engagement vs week 6?

At week 2 we have a prototype shape: schema, intake script, the first 8 to 20 cases lifted from the client's recent incident logs and customer tickets. The CI gate is wired but advisory, not blocking, because the team has not aligned on which cases are truly P0 yet. At week 6 the gate is hard, the dataset has grown to 40 to 80 cases, and the team has run at least one model swap qualification through it. The runbook leaves with the team and the named engineer hands off the intake cron and the dedupe pipeline as part of the week 6 transfer session.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.