Guide, topic: eval harness pinning, retrieval pipelines, 2026

Pin chunking config in your eval harness, or every model release will look like a regression.

Anthropic published a postmortem this spring on a stretch of Claude Code regressions that turned out, in large part, not to be model regressions at all. The retrieval pipeline around the model had drifted between runs (chunking, tokenizers, retrieval config) and the harness logged only the model name. The model took the blame. That failure mode is generic. Any team running a retrieval-augmented agent against a moving model lives one tokenizer roll away from the same misdiagnosis. This guide is about the file in your repo that prevents it.

Matthew Diakonov, Written with AI

Published April 27, 202611 min read

Why the model gets blamed for things the model did not do

The shape of a typical retrieval-augmented agent stack has roughly seven moving pieces between the user message and the model output: the splitter, the tokenizer, the embedding model, the vector store, the retriever, the reranker, and the prompt template. Six of those seven are not the model. Each one of them can quietly change in ways that look, downstream, identical to a model regression. Recall drops, the agent answers a question with worse grounding, the eval score for that case slides three points, and the conversation in the standup the next morning is about whether the latest snapshot from the vendor is worse than the previous one.

The reason this happens is structural, not careless. The eval harness is owned by the agent team. The indexer is owned by the data team. The embedding model upgrade is owned by the platform team. None of those owners thinks of themselves as the owner of the eval surface, and the harness output rarely captures enough state to attribute a regression to any one of them. The line in the run log says model_id and a score. It does not say which splitter ran, which tokenizer ran, which embedding model produced the index, what the retriever k was, or which reranker scored the candidates.

Until the harness logs all of those, every model release is a potential regression and every retrieval tweak is a potential silent regression. The team operates in a fog where the only signal is the score and the only suspect is the model. Anthropic's Claude Code postmortem is the public version of a private experience that most production agent teams have had at least twice.

The pin file: one YAML in the repo

The fix is unsexy. A single file in the eval directory, named something like eval/pipeline_pin.yaml, that pins every variable in the pipeline that can affect retrieval quality. Every eval run reads this file, writes its SHA into the run row, and refuses to start if the live indexer reports different values. The pin is versioned by PR. A real change to chunk size or embedding model is a one-line PR that bumps the pin and triggers a re-baseline. A silent change is impossible because the harness aborts.

The minimum fields the pin needs: chunk_size, chunk_overlap, splitter_name, splitter_sha, tokenizer_name, tokenizer_sha, embedding_model_id, embedding_model_sha, retriever_k, similarity_floor, reranker_id, reranker_sha, prompt_template_sha. Plus a corpus_hash that fingerprints the input documents so a re-ingest with new content is also detectable. The corpus_hash is often the field that catches the most regressions, because corpus drift (someone added new docs, someone re-cleaned old docs) is common and almost always invisible to the eval harness.

The pin file does one more thing that matters: it makes the eval output legible to people who did not run it. An engineer reading the run row from six weeks ago can see the exact pipeline state that produced the score. They can check out the SHA of the pin file, run the harness, and reproduce the number. Without that, the scores from six weeks ago are not data; they are folklore.

Six failure modes the pin file would have caught

Each of these is a real shape of regression that the harness attributed to the model and that a pin file would have surfaced as a pipeline change in the same eval run.

Tokenizer changed under the splitter

An embedding model upgrade quietly changed the tokenizer. The recursive character splitter still says chunk_size=512 but now those 512 characters break differently around code blocks. Recall drops 6 points across the eval set. The model that the team is qualifying takes the blame because the eval harness logged only the model name.

Reranker swapped without a config bump

Someone moves the reranker from a hosted endpoint to a local one to save cost. The local one returns a slightly different ordering for ties. Top-3 recall drops 4 points on the long-tail of the eval set. The week's model qualification gets thrown out because nobody can tell whether the new model is worse or the rerank is.

Embedding model snapshot rotated

The embedding model rolled over a minor version on the provider side. The vector store was rebuilt with the new embeddings on Wednesday. Friday's eval run uses the new index. The aggregate score is the same but the tail-prompt slice that asks about acronyms drops 11 points. The harness has no record of which embedding model SHA was used for which run.

Corpus re-ingested with a different overlap

The team raised chunk overlap from 50 to 100 to fix one bad answer in a demo. The change landed without a PR review. Two weeks later the eval set scores 3 points worse on the multi-document questions because the duplicated context confuses the cross-encoder. The model qualification meeting blames Anthropic's latest snapshot for the regression.

Indexer non-determinism inside the pipeline

A parallel indexer writes documents in non-deterministic order, which changes the candidate set the retriever sees for ties. Same pipeline config, two runs, three points apart. The harness has no way to flag that the index itself drifted, so the variance gets attributed to the model's sampling temperature.

Eval set got bigger and nobody noticed

Three new cases were added to mine a recent customer escalation. They happen to be hard. Aggregate score drops 2 points. The team rolls back to last quarter's model. The actual cause is that the eval set grew three percent harder, which the harness should have surfaced as a denominator change.

Pinned harness vs default harness, side by side

Left column is the pin shape we ship into client repos. Right column is the harness most teams have on day one. The right column is not wrong. It is incomplete. It tells you the model produced a score; it does not tell you what state the pipeline was in when it did.

Feature	Default eval harness	Pinned eval harness
Where chunking config lives	Inside the indexer, set once in code, no record in the eval run output. Two runs from two weeks apart can have different chunking and you find out from the answer quality, not the diff.	eval/pipeline_pin.yaml in the repo, version-controlled, loaded by every eval run. Chunk size, overlap, splitter, tokenizer SHA, embedding model SHA all pinned. Any change is a PR with a diff.
What gets logged per eval run	Just the score and the model name. Reproducing the run requires asking three engineers what state the indexer was in last Tuesday.	pipeline_pin.yaml SHA, embedding model SHA, retriever k, similarity floor, reranker SHA, tokenizer SHA, model_id, model snapshot date. Stored alongside scores so any regression has a config diff in the same row.
How chunking changes get detected	Nobody checks. The first signal is that a model the team trusted scores 8 points lower this week and the on-call engineer spends two days bisecting model snapshots.	Pre-eval check that hashes the indexer output for a fixed corpus and refuses to run if the hash differs from the pinned value without an accompanying PR.
Who gets blamed when scores drop	The model. Default assumption every time. Half the time the model is fine and the harness is the regression.	The diff. Either the pin changed (someone owns it) or the model changed (a vendor owns it). The harness puts the cause on screen before the meeting starts.
Cross-team reproducibility	The numbers from six weeks ago are unreproducible. The corpus has been re-indexed twice, the splitter changed once, and nobody wrote it down.	An engineer who joined yesterday can git checkout, run scripts/run-eval.sh, and produce the same numbers as the run from six weeks ago. The pin file makes the run hermetic.
How a model swap interacts with pipeline state	Model swap PR happens to land the same week as a pipeline tweak. Two changes, one number, no way to attribute.	Model swap is one PR that bumps model_id and re-runs the eval against the same pinned pipeline. The delta is attributable to the model because nothing else moved.
What the postmortem looks like after a bad week	A Slack thread three days long that ends with 'we think it was the upgrade, but we are going to roll back the model anyway.'	git log eval/pipeline_pin.yaml shows every change to the retrieval surface in the trailing window. The conversation is about the diff, not about whether the model regressed.

What the rollout looks like in a week

The first version of the pin file is half a day of work. The difficult part is not the YAML; it is the conversation with the indexer team about exposing the live config so the harness can check it. Once that endpoint exists, the rest is mechanical: read the file, hash the live state, compare, abort on mismatch, log the pin SHA into every run row.

The second week is when the pin starts paying off. The first real-world eval after the pin lands will produce a clean run row that names every dependency. The first time someone changes the embedding model, the harness will refuse to run until the pin is bumped. The first time a model snapshot lands, the harness will run against the same pinned pipeline and the score delta will be attributable to the model alone.

By week three the team's behavior changes. Engineers stop blaming the model by default. Pipeline tweaks get PRs because the harness forces them to. The standup conversation about regressions shortens because the diff is on screen before the meeting starts. This is the unglamorous payoff. Less drama, fewer rollbacks, more real shipping decisions.

Where this fits in a six-week embed

fde10x is one option for teams that want a senior forward-deployed engineer to land the harness, the pin file, the corpus hash, and the canary embeddings in a two to six week embed. The work is not exotic. It is unglamorous infrastructure that happens to be the difference between a team that qualifies model upgrades cleanly and one that rolls them back twice a quarter for the wrong reason. Plenty of teams build this themselves and that is the right call when there is a senior eval engineer on staff already. When there is not, the cost of learning the failure modes from your own production traffic is higher than the cost of an embed.

The leave-behind on a six week engagement is roughly: the pin file, the eval harness extension that reads it and aborts on mismatch, the corpus hash check, the indexer-side endpoint that exposes live config, a canary embeddings test for vendor model rollovers, and a runbook entry that walks the on-call engineer through "the harness aborted with pin_mismatch, what now". Six files. The team owns them. fde10x leaves.

Want a senior engineer to land your eval pin file and stop the rollback cycle?

60 minute scoping call with the engineer who would own the build. You leave with a draft pipeline_pin.yaml against your stack, a list of the indexer endpoints we need wired, and a fixed weekly rate to ship the harness, the pin, and the first clean run inside your repo.

Pinning chunking config in the eval harness, answered

Why is chunking config the part that drifts most quietly?

Because it lives inside the indexer, which usually sits in a separate repo or pipeline from the agent and the eval harness. Three teams touch it. None of them think of themselves as owning the eval surface. Splitter changes, tokenizer rolls, embedding upgrades, and overlap tweaks happen in the indexer for perfectly good reasons (cost, latency, freshness) and they never make it into the eval run record. The eval harness sees only the score and the model name, so the model gets the blame.

What did Anthropic's recent Claude Code postmortem actually say?

The short version is that several of the regressions reported against Claude Code over a stretch of weeks were not regressions in the model itself; they were drift in the surrounding pipeline (chunking, retrieval, prompt assembly) that made the model's behavior look worse on internal evals. The postmortem made the case explicitly that without pinning the surrounding pipeline, you cannot tell a model regression from a harness regression. That distinction is the entire reason the pin file in this guide exists.

What does the pin file actually contain?

The minimum: chunk_size, chunk_overlap, splitter_name, splitter_sha, tokenizer_name, tokenizer_sha, embedding_model_id, embedding_model_sha, retriever_k, similarity_floor, reranker_id, reranker_sha, prompt_template_sha. Plus a corpus_hash that hashes the input documents so a silent re-ingest is detectable. Every eval run reads this file, logs the values into the run row, and refuses to start if the live pipeline disagrees with the pin without a PR.

How do I detect that the live pipeline has drifted from the pin?

Hash the live config at the top of the eval harness and compare. Specifically: ask the indexer for its current splitter, chunk size, overlap, tokenizer, and embedding model; compute a SHA of that struct; compare to the pin. If they disagree, the run aborts with an error that names the field. The right way to fix a real change is to bump the pin in a PR, which makes the change auditable and the score delta interpretable.

What about the embedding model itself rolling over?

Pin the embedding model by full version string or SHA, not by alias. text-embedding-3-large does not pin you. The exact provider snapshot does. If the provider does not expose a SHA, treat the alias as a moving target and add a canary that re-embeds a fixed sentence and checks the vector against a stored reference. If the canary changes, the model rolled, and the eval run flags the change before it scores anything.

Does this slow down the team?

It slows down ad hoc experimentation a little. It speeds up qualifying real model upgrades a lot. The cost of a pin file is one PR per intentional pipeline change. The cost of not having one is a recurring two-day blame cycle every time a vendor ships a snapshot. Two of the engagements PIAS shipped this quarter started because the team was rolling back model upgrades for two months running and the actual cause turned out to be an embedding swap from January.

How is this different from just running evals more often?

Running evals more often catches drift faster but does not tell you what drifted. The pin file is what makes the diff in the run row legible. Without it, the daily eval just produces noise that nobody can attribute. With it, every regression has a one-line explanation in the run output: model changed, embedding changed, splitter changed, corpus changed, or the model genuinely regressed.

Where does fde10x fit in this?

We are one option for teams who want a senior engineer embedded in the repo for two to six weeks to ship the eval harness, the pin file, and the regression suite into your existing CI. Plenty of teams build this themselves; that is the right call when there is a senior eval engineer on staff. When there is not, embedding someone who has done this on five or six prod agents is faster than learning the failure modes from your own production traffic.

What is the smallest version of this I can ship next week?

Three things. First, a YAML file in the repo with chunk size, overlap, splitter name, embedding model name, retriever k, and reranker name. Second, a check at the top of the eval script that loads the file and writes its SHA into every run row. Third, a one-line guard that aborts the run if the live indexer reports different values. That is enough to make 80 percent of the harness drift visible. The rest (corpus hash, SHA pinning, canary embeddings) can land over the next two weeks.

Does the pin work across vendors (Anthropic, OpenAI, Google, Bedrock)?

Yes. The pin describes the pipeline around the model, not the model. Switching vendors changes one line (model_id) and re-runs the eval. The pin makes the swap interpretable: any score delta is attributable to the model because the harness around it is held constant. This is what makes the pin file the most expensive thing to skip when you are about to do a multi-vendor bake-off, which is the next guide in this series.