Guide, topic: agentic RAG, eval harness, corpus drift, 2026

Agentic RAG: the architecture diagram is the easy part.

Every agentic RAG project ships a clean architecture diagram and then runs into the same three failure modes in production: corpus drift, staging-to-prod metric drops, and regressions on previously-fixed retrieval misses. None of these are architecture problems. They are operational problems that the diagram does not represent. This guide is about the rubric, the regression set, and the corpus-drift cron that turn an architecture diagram into a system that still works in quarter two.

Matthew Diakonov, Written with AI

Published April 27, 202613 min read

What the diagram leaves out

A typical agentic RAG architecture diagram has six or seven boxes: the document ingester, the embedder, the vector store, the retriever, the reranker, the agent, and the model. Arrows connect them in a way that looks plausible. The diagram is approved at the kickoff meeting, the team starts building, and for the first month everything tracks the diagram. By month three the system is in production and the bulk of the team's attention is on things that the diagram does not represent: the corpus is changing weekly because content owners are updating documents, the embedder upgraded a minor version on Tuesday and nobody told the agent team, the staging eval scores look fine but prod traces are scoring 13 points lower and nobody knows why, and the same retrieval miss the team fixed in February is back because a different engineer cleaned up the synonym list in April.

None of those problems are architecture problems. The diagram could be perfect and they would still happen, because they are operational problems that emerge from how the system runs over time, not from how its boxes connect. The team that ships only the diagram has built a system that will work for one quarter and degrade for the next four. The team that ships the diagram plus the operational artifacts (rubric, regression set, corpus-drift cron, prod-trace ingest) has built a system that stays honest as the world around it moves.

The operational artifacts are not exotic. They are YAML files, bash scripts, and crons. They are the parts of the project that do not photograph well in a slide deck and therefore do not get prioritized in the original scope. They are also the parts that determine whether the agentic RAG system survives a quarter or gets rewritten.

Six ways agentic RAG breaks after the diagram is approved

Each of these is a real shape of failure that recurs across agentic RAG projects. None of them is fixed by changing the architecture; each one is fixed by an operational artifact that the diagram did not include.

Corpus drift: the indexer ran on Tuesday and the agent has been wrong since

An ingester job re-ran with new content and accidentally dropped a category of documents because of a glob pattern change. The vector store no longer has the dropped docs. The agent answers questions about them with confident hallucinations because the retriever returns nothing and the system prompt does not require an explicit 'no source found' response. The team finds out three weeks later from a customer who escalates. A nightly corpus-drift check would have flagged the document count drop the next morning.

Staging-to-prod metric drop: the eval passed, the users are unhappy

Staging eval scored 0.91 on faithfulness. Production traces score 0.78. The team spends a week looking for a model regression. The actual cause: the staging eval ran against a curated 200-document corpus and prod is running against 47,000 documents with much higher noise. The retriever's similarity threshold was tuned for the smaller corpus. Same agent, same model, completely different retrieval quality.

Missing regression test on a fixed retrieval miss

A query that used to miss got fixed two months ago by adding a synonym to the query expansion layer. Three weeks later a different engineer cleaned up the synonym list and the original miss came back. There was no regression test for it. The same customer escalated the same complaint. The team's response: rebuild the synonym list and add a regression test, two months too late.

Multi-turn drift: turn 6 forgets the context from turn 2

The agent answers turn 1 and turn 2 correctly. By turn 6 the rolling summary has compressed away the user's original constraint and the agent gives an answer that contradicts what it said in turn 2. The eval set is single-turn so this never gets tested. Production conversations average five to seven turns and this is the single most common quality complaint after launch.

Tool oscillation in the agent loop

The agent calls a search tool, gets results, decides they are not good enough, calls a different search tool, gets different results, decides those are not good enough either, calls the first tool again. The loop terminates after the max iteration count and the agent answers without good grounding. The handwritten eval cases never trigger this because they have inputs that succeed on the first tool call.

Citation drift: the source URL is correct, the snippet is not

The retriever returns the right document but the chunker split the document such that the snippet the agent cites does not actually contain the answer; the answer is in the next chunk. The agent confidently cites a URL that, when the user clicks it, contains content that does not support the claim. The single-axis rubric scores this as 'cited a source' equals pass. The per-axis rubric grades citation accuracy separately and catches it.

The corpus-drift cron: the part nobody puts on the diagram

The corpus is the part of the system that is most likely to change without a PR. Content owners update documents. Ingesters re-run on schedules nobody remembers. The embedder ships a minor version. The chunker config gets tweaked because someone wanted to fix one bad answer. None of these changes go through code review on the agent team's repo, and most of them are invisible to the eval harness because the harness only sees the model output, not the corpus state.

The corpus-drift cron is the artifact that makes the corpus visible. It runs nightly. It hashes the indexed corpus (document count, per-doc hash, per-doc-type counts, a distribution summary of the embedding vectors). It compares against the previous run's hash and against the pin in eval/corpus_pin.yaml. Small drift gets logged. Large drift triggers a re-run of the eval set and opens a PR with the diff. The team reviews the change deliberately instead of discovering it from a customer escalation.

The cron is unglamorous. It is roughly 80 lines of bash and 200 lines of Python. It catches the kind of failure that otherwise takes a week of triage to attribute. The first time it catches a botched ingester run, the team understands why it exists. The second time it catches an embedder rollover, it pays for the embed engagement that built it.

The regression set: the cases the team has already paid to fix

Every retrieval miss the team fixes is a case the team has already paid for. The fix took an engineer's day, the customer who reported it spent time on a support ticket, and the triage took the on-call engineer's morning. Throwing that case away after the fix lands is the most expensive thing the team can do because the case is exactly the kind of thing that silently regresses six weeks later when an unrelated change ripples through the retrieval pipeline.

The regression set is the YAML file in the repo that captures every previously-fixed case as a permanent eval case. It grows, never shrinks. Each case has a stable case_id, the original failing query, the expected behavior, and the failure mode it represents. The CI runs the full regression set on every PR. A previously-passing case that newly fails is a hard fail; the merge does not land. The regression set after six months is typically 200 to 800 cases. After two years it is several thousand. That is fine; the cost of running it is small and the value of catching the regression before the customer does is large.

The discipline of writing the regression case at the time of the fix is the only thing that makes this work. A case written three weeks after the fix has lost most of its context and is much harder to maintain. The shape we ship into client repos is a short script that takes a fix PR's description and a sample failing input and writes the regression case automatically. Engineers do not have to remember to add the case; the tooling does.

Operational agentic RAG vs architectural agentic RAG, side by side

Left column is the shape we ship into client repos: rubric, regression set, corpus-drift cron, prod-trace ingest, multi-turn cases, all version-controlled and CI-gated. Right column is what most teams have on day one: a clean architecture diagram and a vague intention to build the rest after launch.

Feature	Architectural-only agentic RAG	Operational agentic RAG
What gets shipped on day one	Architecture diagram. The team will build the eval and the regression set 'after launch.' They never do.	Architecture diagram plus eval/rubric.yaml plus eval/regression_set.yaml plus scripts/corpus-drift-check.sh. The non-architecture artifacts are the part that determines whether the system survives a quarter.
How corpus changes get detected	Nobody hashes anything. The corpus changes when somebody re-runs the ingester. The team finds out from the support queue when the agent starts answering questions about removed products.	scripts/corpus-drift-check.sh runs nightly. It hashes the indexed corpus and compares against the pin in eval/corpus_pin.yaml. Drift triggers a re-eval and a PR if the drift is large enough.
What the staging-to-prod gap looks like	Staging looks great. Prod looks bad. The team blames 'the prod environment' generically and does not isolate the actual cause.	Staging eval and prod-trace eval run against the same rubric. The delta is tracked. If staging passes and prod fails by more than the rubric's tolerance, the gap is named (different corpus, different latency, different traffic mix) and addressed.
How agent-loop failures get caught	Single-turn eval cases only. The agent loop is tested by demoing it. Multi-turn failure modes (tool oscillation, context window blowout, summary distortion) ship as production incidents.	The eval set includes multi-turn cases that exercise the agent loop (tool selection, retry, fallback, summarization). Failure on a multi-turn case blocks merge.
How retrieval quality gets graded	'It seems to find good stuff most of the time.' One number, sometimes. No regression rule.	Per-axis rubric: precision, recall, citation accuracy, freshness. Each axis has a threshold and a regression rule. A previously-passing query that now misses is a hard fail.
What happens after a model swap	The team eyeballs a few queries, decides it looks fine, and ships. A regression on a previously-fixed retrieval miss is rediscovered the hard way.	The rubric and regression set re-run against the new model. Per-slice deltas are visible. The shipping decision is mechanical against the rubric.
How the agentic loop interacts with retrieval failures	Retrieval miss equals hallucination, every time. The agent never learned to say 'I don't know' because the eval set never tested for it.	The agent's behavior on a retrieval miss is graded explicitly: did it acknowledge the gap, did it re-formulate the query, did it escalate, or did it hallucinate. Each behavior is a separate failure mode.

Multi-turn cases: the slice the eval set most often skips

Most agentic RAG eval sets are single-turn because single-turn cases are easy to write and easy to grade. Production conversations are not single-turn. Customer support agents average four to seven turns. Coding agents average twenty. The failure modes that emerge in turn six (context drift, summary distortion, tool oscillation, persona slip) are invisible to a single-turn eval suite, which means the agent's most important failure modes never get tested before shipping.

The fix is to build a multi-turn slice into the eval set. A multi-turn case is a sequence of user messages plus the expected behavior at each turn. The grader runs the agent through the full sequence and grades each turn on the relevant axes. The case fails if any turn fails. Building multi-turn cases is more work than single-turn cases, but the work pays off because the failure modes the multi-turn slice catches are exactly the ones that drive the most user-facing complaints about agent quality.

A reasonable starting point is to mine the longest production conversations from the trailing month, anonymize them, and replay them as multi-turn eval cases. Roughly 50 to 100 multi-turn cases is enough to start catching the common multi-turn failure modes; the slice grows over time as new failure modes get discovered.

Where fde10x fits

fde10x is one option for teams that want a senior engineer embedded for two to six weeks to ship the operational layer of agentic RAG: the rubric, the regression set, the corpus-drift cron, the prod-trace ingest, and the multi-turn eval slice. The architecture diagram is usually already done by the time we arrive; the value of the embed is the unglamorous infrastructure that determines whether the architecture survives the quarter.

Plenty of teams build this themselves. The embed is the right call when the team has shipped agentic RAG and is now firefighting a recurring set of complaints (corpus drift, staging-to-prod gap, recurring retrieval misses) without having the operational artifacts to attribute the cause. The leave-behind is the operational layer running in CI, owned by the team, with a runbook for the on-call engineer when the cron flags drift at 02:00.

Want a senior engineer to ship the operational layer of your agentic RAG?

60 minute scoping call with the engineer who would own the build. You leave with a draft of the rubric, the regression-set seed list, the corpus-pin shape for your stack, and a fixed weekly rate to ship the cron, the multi-turn slice, and the first clean PR run inside your repo.

Agentic RAG and the operational layer, answered

Why is the architecture diagram the easy part?

Because it represents the parts of the system that are well understood, like the parts that fit cleanly into a vendor's slide deck: the embedder, the vector store, the retriever, the reranker, the agent, the model. Those components are all real and they all matter. They are also the parts that have the smallest impact on production failure rates once the system is running. The hard parts are the corpus management, the rubric, the regression set, the multi-turn behavior of the agent loop, and the operational discipline to keep all of that current. Those parts do not fit neatly into a diagram. They live in YAML files and crons that nobody puts on the slide.

What does corpus drift actually look like?

The corpus changes in three ways. Content drift: documents are added, removed, or updated. Pipeline drift: the chunker, the embedder, or the ingester changes behavior, so the same source documents produce different vectors. Distribution drift: the proportion of doc types in the corpus shifts, which changes which queries the retriever does well on. All three happen in production whether the team manages them or not. The corpus-drift cron makes all three visible. Content drift shows up as document-count and document-hash deltas. Pipeline drift shows up as embedding-distribution shifts. Distribution drift shows up as the per-doc-type retrieval recall changing on a fixed query set.

Why does staging-to-prod almost always show a metric drop?

Because staging is a controlled environment and prod is not. The staging corpus is usually smaller, cleaner, and curated. The staging traffic is a known sample. The staging latency budget is generous. Prod has all of the noise: the long tail of corpus content, the long tail of user queries, the long tail of latency events. The drop from staging to prod is almost never the model. It is almost always the corpus and the traffic mix. The fix is to make staging more like prod (sample real corpus content, replay real traces) and to track the gap explicitly so the cause of any new gap is named.

What does the regression set look like for agentic RAG?

It is a list of cases the team has previously fixed, each with a stable case_id, the original failing query, the expected behavior, and the failure mode it represents. Every PR runs the full regression set. A regression on a previously-passing case is a hard fail. The set grows over time, never shrinks; cases that have been passing for a year are still in the set because they are exactly the cases most likely to silently regress when nobody is watching. A typical agentic RAG project has 200 to 800 cases in the regression set after six months of operation.

How is agentic RAG different from plain RAG when it comes to evals?

Plain RAG is one shot: query in, retrieved chunks plus answer out. The eval is grade the answer. Agentic RAG has a loop: the agent decides what to retrieve, sometimes calls multiple retrievers, sometimes re-formulates the query, sometimes summarizes intermediate results into the next turn. The eval has to grade not just the final answer but the loop behavior: did the agent pick the right tool, did it know when to stop, did it gracefully handle a retrieval miss, did it preserve the user's constraint across turns. The eval set has to include multi-turn cases and the rubric has to grade per axis on each turn.

What does the corpus-drift cron actually do?

It runs nightly. It hashes the indexed corpus (document count, per-doc hash, per-doc-type counts, embedding distribution summary). It compares against the previous run's hash and against the pin in eval/corpus_pin.yaml. Small drift gets logged. Large drift triggers a re-run of the eval set and opens a PR with the diff so the team can review the change deliberately. The cron is roughly 80 lines of bash plus 200 lines of Python. It pays for itself the first time it catches a botched ingest before customers do.

How do I write a rubric that grades the agentic loop?

Per-axis, per-turn. The axes are the same as for plain RAG (faithfulness, helpfulness, completeness, citation accuracy) plus loop-specific axes (tool selection, retrieval gap acknowledgment, multi-turn coherence, escape behavior). For a multi-turn case, each turn gets graded on the relevant axes and the case fails if any turn fails. The judge prompt is pinned per axis. The rubric file lives in the repo and is reviewable like any other code.

What is the role of synthetic eval cases vs production traces?

Synthetic cases cover the failure modes the team can articulate: known adversarial patterns, known retrieval edge cases, known tool failure shapes. They are valuable because they are deterministic and reproducible. Production traces cover the failure modes the team cannot articulate: the long tail of real user behavior. A mature agentic RAG eval set is roughly 30 percent synthetic and 70 percent production-mined, with the synthetic share weighted toward adversarial and high-stakes cases that production may not produce often enough to feed the harness.

Where does fde10x fit?

We are one option for teams that want a senior engineer to embed for two to six weeks and ship the rubric, the regression set, the corpus-drift cron, the prod-trace ingest, and the multi-turn eval cases into the client repo. Common engagement: week 1 authoring the rubric and pinning the corpus, week 2 building the regression set from past incidents and the agent's fix history, weeks 3 to 4 wiring the cron and the prod-trace ingest, weeks 5 to 6 stabilizing the multi-turn eval cases and connecting the gate to CI. The leave-behind is the operational layer that keeps agentic RAG honest after the architecture diagram is approved.

What is the smallest version of this I can ship next week?

Three things. First, eval/corpus_pin.yaml with the document count, the top-level hash of the indexed corpus, and the per-doc-type counts. Second, a nightly script that re-computes those values and writes the diff to the team's chat. Third, a YAML file with 30 cases (20 synthetic, 10 from past incidents) and a smoke-test script that runs them against staging. That is enough to detect the most common drift and the most embarrassing regressions. The full pipeline (per-axis judge, multi-turn cases, prod-trace ingest) is the version-two work that scales it.