Guide, topic: agent eval harness, prod-trace ingest, 2026

Your handwritten eval cases will not catch the failures users actually hit.

Eval cases that the team brainstormed in a sprint reflect what the team imagined the agent would have to handle. Production traces reflect what the agent actually has to handle. The gap between them is where every interesting failure lives, because the failures the team can imagine are the ones the team has already mostly solved. This guide is the graded-trace ingest pipeline we ship into client repos: how to sample traces, how to grade them, how to promote the graded ones into the eval harness, and the cron that keeps the harness in sync with the agent's actual users.

M
Matthew Diakonov
12 min read

Why the eval set the team wrote at launch goes stale

The eval set the team writes at launch reflects three things: the scenarios in the spec, the demo flows the founders showed investors, and a handful of edge cases an engineer thought of in the sprint before launch. None of these things are production traffic. Production traffic is whatever the agent's actual users send it, which after launch starts diverging from the spec immediately. New intents emerge. Old intents change shape. Edge cases that the team thought were rare turn out to be 8 percent of traffic. Edge cases the team built whole subsystems for turn out to be 0.1 percent of traffic.

The eval set, meanwhile, does not move. It is a YAML file that sits in the repo. Engineers do not update it because there is no forcing function. Product does not update it because product does not own it. The team ships the agent twelve times in a quarter and the eval set grows by maybe 30 cases, all of them written in response to specific incidents that escalated all the way to engineering. The result is an eval set that increasingly measures the agent's performance on a slice of traffic that no longer exists.

The fix is to flip the source of truth. The eval set's primary source is no longer the team's imagination; it is production. The team still curates adversarial cases by hand, and still keeps a small smoke-test slice for fast feedback. But the bulk of the eval set is mined from real traces, graded by a rubric, and promoted on a cron. The handwritten cases supplement; the production traces anchor.

The shape of the trace ingest pipeline

The pipeline has five stages and runs daily for high-volume agents, weekly for lower-volume ones. Stage one is sampling: pull traces from the trailing 24 or 168 hours, stratified by intent so the sample reflects actual distribution. Stage two is redaction: every sampled trace runs through a PII redactor before anything else touches it. Stage three is grading: a rules layer checks hard constraints (right tool called, real source cited, policy honored), then an LLM-as-judge layer grades the response on a per-axis rubric. Stage four is categorization: failed traces get bucketed by failure mode. Stage five is promotion: the failures that match an existing failure mode get appended to that case_id's history; new failure modes get a new case_id and land in eval/cases_from_prod.yaml.

The pipeline writes a one-page summary at the end of every run: how many traces graded, how many failed each axis, how many got promoted, what the top three new failure modes look like. That summary is the artifact the team reads at the next standup. The promoted cases live in the same eval harness the CI runs on every PR, so any regression on a promoted case is caught before the next merge.

The cost of running the pipeline is roughly the cost of grading the failed traces with the judge model, which depends on the agent's failure rate and trace volume. For a typical mid-volume agent (10k requests a day, 5 percent failure rate), the cost is on the order of $20 to $50 a day in judge calls. That is the rounding error on the engineering time saved by not re-triaging the same incident every two weeks.

Six failure modes only the trace pipeline finds

Each of these is a real shape of failure that is invisible to a handwritten eval set and immediately legible in graded production traces. They are the reason the trace pipeline pays for itself inside the first month.

The intent the team did not know existed

A logistics agent shipped with eval cases for shipment lookup, returns, and address change. Production traces from week three contained a steady 4 percent of conversations that were actually about customs declarations on international orders. The team did not know this intent existed at launch. The handwritten eval set never would have caught it. The graded trace pipeline picked it up in the second week of operation and the team shipped a fix before it became a support volume problem.

Multi-turn drift the eval cases never simulated

Hand-written eval cases are almost always single-turn or two-turn. Production conversations are seven turns on average for support agents and twenty for coding agents. The failure modes that emerge in turn 6 (context window overflow, summary distortion, persona drift) are invisible to a single-turn eval suite. Promoting graded multi-turn traces into the eval set is the only way to catch them before users do.

Tool calls that succeed but return the wrong thing

A function returns a status of 200 with an empty result set because the query was malformed in a way the schema validates. The agent treats the empty result as ground truth and confidently tells the user the record does not exist. The handwritten eval cases never test for this because they assume tool outputs are correct when status is 200. The graded trace pipeline catches it because the user's next turn is a complaint that obviously contradicts the agent's confident answer.

Retrieval misses on entities the corpus only mentions once

The eval set has cases for the top 50 entities in the knowledge base. Production traffic asks about entities ranked 800 through 1500 with a long tail that compounds. The retriever returns nothing for those queries and the agent hallucinates a plausible answer. The handwritten eval set scores the agent at 89 percent on retrieval. The graded trace pipeline reveals the real number is closer to 62 percent on the long tail.

Prompt injection patterns nobody on the team thought of

A new injection pattern shows up in production a week after a public LLM red team writeup gets shared in some adversarial Discord. The handwritten eval set has the patterns from the original blog post, not the variant the attackers actually use. The graded trace pipeline catches it because the agent's response no longer matches the expected behavior on traces that look adversarial.

Latency-induced failures the lab never sees

An external tool times out under production load in a way the agent's retry logic does not handle correctly, leaving the model to answer without the tool's output. The handwritten eval cases run against a fast local mock and never trigger this path. The graded trace pipeline surfaces it because the response shape is recognizably the no-tool fallback shape.

The grading layer: rules first, then judge

The grading layer is two stacked filters. The first is a rules engine that checks hard constraints: did the agent call the right tool for the intent, did it cite a real source from the corpus, did its response conform to the policy schema, did it answer in the right language. These rules are deterministic and cheap. They catch about 40 to 60 percent of failures without ever calling a judge model.

The second filter is an LLM-as-judge that grades the response on a per-axis rubric: helpfulness, faithfulness, completeness, tone. The judge is pinned (model and prompt SHA), and it is calibrated against a small human-labeled gold set every two weeks to track judge drift. If the agreement rate between judge and human labels drops below 0.85, the judge prompt gets retrained or the judge model gets pinned to a different snapshot. The judge layer catches the ambiguous failures the rules layer cannot articulate.

Splitting the layer this way is the difference between a grading pipeline that costs $20 a day and one that costs $400. The rules layer is the cheap filter; the judge layer only sees the cases that survived the rules. Without the rules layer the judge gets called on every trace and most of those calls are wasted on traces that pass any reasonable rubric.

Trace-fed harness vs handwritten harness, side by side

Left column is the harness shape we ship: trace-anchored, graded, promoted, regression-tracked. Right column is what most teams have on day one: a handwritten YAML the original engineer wrote and nobody else has the energy to maintain.

FeatureHandwritten-only eval harnessTrace-fed eval harness
Where eval cases come from100 to 300 cases the team brainstormed in a sprint two quarters ago. Distribution is whatever the team thought to write down.Sampled from production traces in the trailing 30 days. Stratified by intent so the eval distribution matches actual usage. Hand-written cases supplement, they do not anchor.
How traces become graded eval casesFailed traces sit in the observability tool. Nobody promotes them. Three weeks later the same failure recurs in production and the same alert fires.Every sampled trace runs through a grading rubric (LLM-as-judge plus a rules layer for hard constraints). Traces that fail get categorized by failure mode and promoted into eval/cases_from_prod.yaml with a stable case_id.
What the eval set looks like after a quarterSame 200 cases the team wrote at the start, slowly going stale. The agent has shipped twelve times, the eval set has grown by maybe 30 cases.Roughly 800 to 2,000 cases, half mined from prod traces, the rest curated. Stratified by intent and stakes. Each case has a known failure mode and a regression history. The eval set is the institutional memory of every prior failure.
Coverage of failure modes the team did not anticipateNear zero. Hand-written cases cover the failures the author already knows about. The interesting failures are the ones nobody thought to write.High. Prod traces surface failure modes the team would not have thought to write: weird input shapes, edge-case tool failures, multi-turn drift, retrieval misses on uncommon entities.
How regressions get caught before productionThe team finds out about regressions from a customer ticket or a Slack message from a coworker who happened to retry the agent.The promoted cases run as part of the standard eval harness on every PR and on every model snapshot. A regression on a previously-failed case is a hard fail.
Privacy and PII handling on tracesTraces with PII land in a screenshot in a JIRA ticket. Six months later an auditor asks what happened to that customer's record.Traces run through a redactor before they enter the eval store. PII surfaces are masked. The graded case keeps the structural shape of the trace without the sensitive payload.
Cost of a failure mode showing up twiceFull incident cost. New ticket, new triage, new repro, new fix, new rollout. Same engineering cycle as the first time, every time.Near zero engineering cost. The case is already in the eval set, the regression is caught on the next PR, the conversation in the standup is about the fix.

What this looks like at the six-week mark

By week six the eval set has roughly 800 cases promoted from production traces, give or take the agent's volume. The cases are stratified by intent and tagged by failure mode. The CI runs the full eval on every PR. New failure modes get categorized and surfaced in the daily summary. The team's on-call rotation has stopped seeing the same recurring incidents because the regression test for each one is in the harness.

The behavior change in the team is the actual deliverable. The standup conversation about "the agent is being weird in production again" goes away because the trace summary already named the failure mode the previous morning. The roadmap conversation about "what should we work on next" gets honest because the failure-mode bucket counts are on the screen. Engineers stop arguing about what users want because the trace ingest is showing them.

Where fde10x fits

fde10x is one option for teams that want a senior engineer to embed and ship the trace sampling, redactor, grading layer, promotion script, and the daily summary into the client repo in two to six weeks. The leave-behind is the pipeline running in the team's CI, on the team's traces, writing into the team's eval set, owned by the team. We are not the only path to this; teams build it themselves all the time. The embed is the right call when the eval set has been static for a quarter and the same failure modes keep returning in production.

The work is unglamorous. It is not a research breakthrough. It is the infrastructure that makes everything else (model bake-offs, regression detection, shipping decisions) work against real data instead of the team's memory of last quarter.

Want a senior engineer to wire prod traces into your eval harness?

60 minute scoping call with the engineer who would own the build. You leave with a draft of the sampling cadence, the rules layer for your stack, the judge calibration plan, and a fixed weekly rate to ship the pipeline, the promotion script, and the first graded run inside your repo.

Prod-trace ingest into the eval harness, answered

Why are handwritten eval cases so limited?

Because the author is the bottleneck on what failures get tested. Engineers write the failures they have seen or can imagine. The interesting failures in production are the ones nobody on the team has seen yet, by definition. A handwritten eval set tests the team's mental model of the agent. A trace-fed eval set tests the agent against reality. The two are different and neither subsumes the other; the handwritten cases stay useful for adversarial testing and stake-holder demos. The trace-fed cases are what catch regressions on the failure modes that actually occur.

How do I sample traces without drowning the team in cases?

Stratify by intent and stakes, then sample within strata. The right sample size depends on the agent: for a high-volume customer support agent, 500 traces a week stratified across 20 intents is plenty to feed the eval set; for a low-volume internal agent, you may want to grade every trace. The grading pipeline does the filtering: traces that pass the rubric do not become cases, only the failures and the borderline ones do. So the cost is a function of the failure rate, not the trace volume.

What does the grading rubric look like?

Two layers. A rules layer that checks hard constraints: did the agent call the right tool, did it cite a real source, did it stay within the policy guardrails. Then an LLM-as-judge layer that grades the response on a per-axis rubric (helpfulness, faithfulness, completeness, tone). The judge is pinned by model and by prompt SHA, calibrated against a 50-case human-labeled gold set every two weeks. The rules layer catches the unambiguous failures; the judge layer grades the ambiguous ones. Both write into the trace's grade record.

How does a graded trace become an eval case?

A trace that fails the rules layer or scores below threshold on the judge gets promoted to eval/cases_from_prod.yaml with a stable case_id, the original input, the expected behavior (synthesized from the failure mode), and a category. The promotion is automatic for clear-cut failures and queued for human review for borderline ones. The case_id is stable across runs so regressions on the same case are tracked over time, not counted as new failures.

How do I handle PII in promoted traces?

A redactor runs before the trace enters the eval store. It masks names, emails, phone numbers, account numbers, addresses, and any custom PII patterns the team defines. The redactor is itself tested against a corpus of known PII patterns and the test fails the build if the redactor regresses. The promoted case keeps the structural shape of the trace (the intent, the failure mode, the tool sequence) without the sensitive payload, so the case is reproducible without exposing customer data.

What is the failure if I do not promote graded traces?

Recurrence. The same failure mode shows up in production every two or three weeks because the eval harness has no memory of it. The team treats each occurrence as a new incident, opens a new ticket, writes a new fix, ships it, and waits for the next occurrence. The cost compounds because the team is doing the same triage work every cycle. Promoting the case once is the difference between a one-time fix and a permanent regression test.

How often does the trace ingest pipeline run?

Daily for high-volume agents, weekly for lower-volume ones. The cadence is set by how fast production traffic shifts and how fast the team can absorb new cases. Daily is right when the agent is in active iteration; weekly is fine once the agent stabilizes. The pipeline writes a summary at the end of each run: how many traces graded, how many failed, how many got promoted, what the new failure modes look like.

Does this replace the bake-off process?

No. The trace-fed cases live in the same eval harness the bake-off uses, and they make the bake-off more meaningful because the slices reflect real production distribution. The bake-off is the pre-shipping gate; the trace-fed cases are the corpus the gate runs against. The two together are the shipping system: bake-offs decide which model ships, trace ingest decides what the bake-off is actually measuring.

Where does fde10x fit?

We are one option for teams that want a senior engineer to embed for two to six weeks and ship the trace ingest pipeline, the rules-plus-judge grading layer, the redactor, and the promotion script into the client repo. The work is not exotic; it is unglamorous infrastructure that is the difference between an eval set that ages well and one that goes stale in a quarter. Teams build this themselves all the time; the embed is the right call when the eval set has not grown in three months and the same failure modes keep recurring in production.

What is the smallest version I can ship next week?

Three things. First, a script that pulls the last 7 days of failed traces from your observability tool. Second, a manual grading pass over them to assign failure modes. Third, a YAML file in the eval directory with one case per failure mode, using the trace input as the case input and the expected behavior as the grader's notes. That is a real prod-fed eval slice. The pipeline (automated grading, redaction, promotion, cron) is the version-two work that scales it.