Guide, topic: agentic RAG, eval harness, retrieval drift, 2026

The architecture diagram is easy. The eval setup is what breaks.

A clean agentic RAG diagram with three boxes, two routing arrows, and a vector store fits in a slide. What breaks once it ships is the eval setup, not the architecture. Teams celebrate a 92 percent ragas score plus human eval in staging, then watch faithfulness drop to 67 percent in week one because the production retrieval corpus drifted from the staging corpus while nobody was looking. Without a rubric and a regression set in the repo, multi-agent just means more failure surface to debug at 2am. This guide is the eval shape that catches retrieval drift before the customer does.

Matthew Diakonov, Written with AI

Published April 27, 202613 min read

Why a 92 percent staging score becomes 67 in week one

The staging score and the production score measure two different systems against two different distributions, and the team usually does not realize that until it has shipped. The staging system runs against a curated corpus, a curated eval set, and rehearsed traffic. The production system runs against whatever the ingest job produced last night, against the long tail of customer queries, with chunkers and tokenizers and embedding snapshots that are all moving on independent schedules. The 25-point drop is the sum of small drifts on each of those axes, and ragas does not surface any of them because ragas is calibrated on a single corpus snapshot.

The architecture diagram makes the system look like a controlled pipeline. The reality is that every box in the diagram has a freshness contract, an upstream config, and an unowned failure mode. Multi-agent multiplies the surface: now there are three or four boxes, each with its own freshness contract, plus the routing decisions between them. Without per-agent eval slices, the team cannot localize a regression to a single agent. They can only tell that "the system" is worse, which is not actionable at 2am.

The fix is not a better diagram. It is a rubric file in the repo, a regression set tagged with corpus snapshots, per-agent slices, and a CI gate that runs on every PR. The fix sounds boring because it is. It also closes the gap; nothing else does, because nothing else turns the failure modes into measurable artifacts.

Six concrete drift modes that staging eval cannot catch

Each of these is a real shape of the staging-to-production drop. Each one explains a specific class of "we tested it, why is it broken" outcome and each one points at a specific section of the rubric and regression set that has to exist if the agent is going to clear the production bar.

Staging corpus is curated. Production corpus is whatever the ingest job produced last night.

The staging corpus has been hand-tuned by the engineer who built the retriever. The chunks are clean, the duplicates are removed, the irrelevant sections are stripped, and the boilerplate is trimmed. The production corpus is whatever the ingest pipeline produced from the last cron run, which silently picked up a new doc template, doubled the boilerplate, and inserted three pages of front-matter into every chunk. Faithfulness on the same questions drops because the retrieved chunks are now mostly noise.

Staging chunk size is 800 tokens. Production chunk size is 800 tokens with a 100-token overlap that nobody enabled in staging.

Two engineers work the project. One ships the retriever. The other ships the chunker. The chunker change merges with a 100-token overlap that improves recall on the staging eval. The retriever was tuned without overlap. Once both are in production, retrieval returns more chunks per query, the context window fills with redundant content, and answer relevance drops because the model is summarizing a lot of repeats. The architecture diagram does not show the overlap parameter.

Staging embeddings are from last month's snapshot. Production embeddings are recomputed weekly.

The vector index in staging was computed once at the start of the project and never updated. The production index is recomputed weekly because new content is being added. The two indices have diverged on roughly 8 percent of chunks; the embedding model is the same but the upstream tokenizer was patched. Cosine similarity rankings shift, top-k retrieval returns different chunks, and the eval score that was 92 in staging is 67 in production for reasons nobody can find for two weeks.

Staging traffic is rehearsed. Production traffic is what actual users send.

The staging eval set was built from an internal beta with 40 employees who knew what the system was supposed to do. The production traffic is from 4,000 customers who have never seen the product before, who phrase their questions differently, who use abbreviations the team has never seen, and who type at half the average reading speed. The retriever's behavior on those queries is statistically different from staging, and the eval set never covered them.

Staging tests one agent. Production runs three.

The retriever passed eval in isolation. The responder passed eval in isolation. The planner is new and has no eval at all because nobody decided who owned it. Once the three are wired together, the planner sometimes routes around the retriever (because the prompt template suggested it could), and end-to-end faithfulness degrades on a slice that exercises the routing path. Multi-agent rolled out without per-agent slices and without an end-to-end slice; the regression is invisible until a customer hits it.

Staging measures aggregates. Production users live on the tail.

The aggregate ragas score in staging is 0.92. That number is dominated by the head of the distribution: easy questions, well-covered topics, short conversations. The tail of production is the long-context multi-turn questions, the questions about edge-case products, the questions where the user already escalated once. On the tail slice the score is 0.61, but the team is celebrating the 0.92 aggregate. The execution gap is the difference between those two distributions.

The regression set, with corpus snapshot tagging

The regression set is what makes retrieval drift visible. Every case in the set is tagged with the corpus snapshot ID it was captured against, the retrieved chunk IDs, and the expected per-axis scores. The harness can replay the case against any later corpus snapshot and report which chunks changed, which scores moved, and which axes regressed. This turns retrieval drift from a vague feeling into a specific diff that names files, chunk IDs, and axes.

The slices: head (high-volume queries with known-good answers), tail (low-volume queries with documented expected behavior), long-context (multi-turn conversations where the failure mode is in turn five plus), high-stakes (queries where 'mostly right' is unacceptable), adversarial (prompt injections, jailbreaks, off-topic deflections). Each slice gets its own threshold for each axis. Aggregate scores are sanity checks; slice-and-axis scores are the gate.

The growth pattern: the set ships with maybe 80 to 200 cases hand-curated across the slices. Within six months, the prod-trace ingest pipeline has promoted another 400 to 800 cases from real failure incidents. By year two, a healthy regression set is 1,500 to 3,000 cases and the team has a measurable record of every failure mode the system has ever had in production.

Per-agent eval slices for multi-agent loops

Multi-agent works in production when each agent has its own eval slice and the orchestration has its own. The retriever has a retrieval-quality slice (context recall, context precision, chunk-ID stability against snapshot). The responder has a faithfulness slice (does the answer reflect the retrieved context). The planner, if there is one, has a routing-correctness slice (does it call the right tool for the right query). The end-to-end run has an aggregate slice for the user's actual experience.

On a regression, CI names the failing agent in the PR comment. This is what makes multi-agent debugging tractable. Without per-agent slices, the team can only tell that the system is worse, which is not enough to root-cause at 2am. With per-agent slices, the PR comment says "the responder's faithfulness on the high-stakes slice dropped 4.2 points; the retriever and planner are unchanged" and the on-call engineer knows where to look.

Production rubric vs ragas-only eval, side by side

Left column is the rubric shape that catches retrieval drift and isolates per-agent regressions. Right column is the common pattern: a single ragas score in staging, run once before launch, treated as the headline number. The right column is fast on the first project. The left column is the only thing that survives the first major corpus refresh.

Feature	Single ragas aggregate, run once in staging	Per-axis, per-slice rubric with snapshot tagging
What gets measured	A single ragas aggregate run once in staging on the curated dev corpus. Reported in the launch deck as 0.92. Never recomputed against production traffic.	Per-axis scores (faithfulness, context recall, context precision, answer relevance, helpfulness) per slice (head, tail, high-stakes, fresh, stale). The retrieval slice is graded against a known-fresh ground-truth corpus snapshot.
How retrieval drift is detected	A user reports a wrong answer four weeks after launch. Engineering re-ingests the corpus and guesses.	A nightly job replays the last 24 hours of production traces against the regression set. Any case whose retrieved chunks changed by more than the configured threshold gets flagged for human review.
Where the eval cases come from	Hand-written by the engineer during the prototype. Frozen at launch. Drifts from reality on the same schedule as the corpus.	Promoted prod traces. Every case in the regression set started life as a real user session that failed and got triaged into the set by a human. Cases are tagged with the corpus snapshot they were captured against.
How CI handles a multi-agent change	Tests run on the planner. The retriever and the responder are assumed fine. The actual regression is in the responder, found two weeks later by a customer.	The harness runs against every agent in the loop, plus the orchestration trace. CI fails on a slice regression for any single agent or for the end-to-end trace, with the failing agent named in the PR comment.
What 'multi-agent means more failure surface' actually costs	Unbounded. Three agents, three corpora, two routing decisions, and a planner mean six places the regression can live, and the team plays whack-a-mole at 2am.	Bounded. Per-agent eval slices isolate the failing agent. Time-to-attribution for a regression is measured in PR comments, not weeks of incident response.
What the ragas score is used for	The headline number. Reported to leadership monthly. Treated as a single proxy for system health.	One signal among five, on one slice among five. Never the headline number. Aggregates are sanity checks; slice-and-axis scores are the gate.
Where retrieval drift gets fixed	In the prompt. The team adds a few new instructions hoping the model will compensate for a stale corpus. It does not.	In the corpus pipeline (the freshness owner), with a rubric-required diff between staging and prod corpus snapshots. Drift is treated as a config bug.

Where fde10x fits

fde10x is one option for teams that want a senior engineer to embed for two to six weeks and ship the agentic-RAG eval rubric, the per-agent slices, the regression set with corpus snapshot tagging, the freshness pipeline ownership, and the CI gate into the client repo. Common engagement: week 1 collaborating on the rubric and the per-agent slice definitions, week 2 wiring the harness and the corpus snapshot store, week 3 backfilling the regression set from past incidents, weeks 4 to 6 connecting it to CI and the trace ingest. The leave-behind is the gate running against every PR, owned by the team, version-controlled in the repo.

Want a senior engineer to ship the rubric, the regression set, and the CI gate for your agentic RAG?

60 minute scoping call with the engineer who would own the build. You leave with a draft of the rubric file against your stack, the per-agent slice definitions, the corpus snapshot strategy, and a fixed weekly rate to ship the gate, the regression suite, and the first clean PR run inside your repo.

Agentic RAG and retrieval drift, answered

Why does ragas plus human eval in staging not catch retrieval drift?

Because both are calibrated against the staging corpus snapshot at the moment the eval ran. The staging corpus is by construction cleaner, smaller, and more curated than the production corpus. The eval cases are picked to cover the topics the team has prepared answers for. Once the system ships, the retrieval surface changes (new content, new chunkers, new tokenizer, new embedding snapshot, new traffic distribution) and the staging eval cannot tell you whether any of those changes broke things. The fix is not a better staging eval; it is a regression set tagged with the corpus snapshot it was captured against, replayed continuously against production traces.

What does the regression set look like for an agentic RAG system?

Roughly five slices, each as a separate JSONL file in the repo. Head: high-volume queries with known-good answers. Tail: low-volume queries with documented expected behavior. Long-context: multi-turn conversations where the failure mode is in turn 5+. High-stakes: queries where 'mostly right' is unacceptable (refund eligibility, regulatory disclosure, medical accuracy). Adversarial: prompt injections, jailbreaks, off-topic deflections. Every case has the corpus snapshot ID it was captured against, the retrieved chunk IDs, the expected per-axis scores, and a documented reason for being in the set.

How do you keep retrieval drift from silently degrading the agent over time?

Three patterns together. First, a nightly diff job that re-ingests the production corpus and reports any chunk-level changes since the last run. Second, the regression set is tagged with corpus snapshots, so you can replay yesterday's regression against today's corpus and see exactly which cases changed. Third, the rubric has a freshness slice: cases that have a known-correct-as-of-date and an expected staleness budget. If the slice fails, the freshness pipeline owner gets paged before the customer hits it. None of these are clever. All of them require somebody to own the corpus pipeline as a first-class artifact, not a side effect.

What does it mean to fail CI 'on a slice regression for any agent'?

The harness scores each agent in the multi-agent loop independently against a per-agent slice (the retriever has a retrieval-quality slice, the responder has a faithfulness slice, the planner has a routing-correctness slice), plus the end-to-end trace against an aggregate slice. CI compares the per-agent and end-to-end scores to the previous main-branch run and to the rubric thresholds. A regression on any single agent fails the build with the agent named in the PR comment, even if the end-to-end aggregate looks fine. This is what makes multi-agent debugging tractable; without per-agent slices, you cannot tell which agent in the loop introduced the regression.

Is multi-agent always a bad call for production RAG?

No, but it is always more expensive to operate than a single-agent system, and the cost shows up in eval surface, not in inference cost. Three agents have three eval surfaces (per-agent quality), one orchestration eval surface (routing correctness), and one end-to-end surface (the user's experience). Five surfaces mean five places a regression can live and five places the rubric has to cover. Teams that go multi-agent without budgeting for that surface end up with a system that 'works in staging' and breaks in production for reasons that take weeks to find. Teams that budget for it ship faster than single-agent teams in the long run because the per-agent isolation makes incremental improvement tractable.

Where does fde10x fit?

fde10x is one option for teams that want a senior engineer to embed for two to six weeks and ship the agentic-RAG eval rubric, the per-agent slices, the regression set with corpus tagging, and the CI gate into the client repo. The work is collaborative: the engineer drives the eval design with engineering, owns the corpus snapshotting and trace replay infrastructure, and leaves the team with the freshness pipeline owned by a documented role. We are not the only path; teams build this themselves all the time. The embed is the right call when the team has shipped agentic RAG and is now firefighting retrieval drift incidents at a rate that is eating the engineering budget.