Guide, topic: LLM wiki eval set, 2026

Your LLM Wiki is quietly rotting between compiler runs and the eval set that catches it is two YAML files plus 60 lines of Python.

The LLM Wiki pattern, the one Karpathy published as a gist and that a half-dozen follow-up blog posts walk through, has a clean shape: an intake folder, a wiki folder, a librarian agent, and sometimes a Hermes-style validator on the promotion gate. None of those posts describe what happens between compile cycles. The librarian re-emits an article and quietly summarizes "sqrt(d_k)" into "a normalization factor" into nothing. The source paragraph in intake/ never moved. Six compile cycles later the wiki has lost the named technique and a downstream agent answers a question wrong in production. This guide is the regression eval set we ship alongside the wiki to catch exactly that: a per-article challenge file, a paragraph-hash provenance file, and a 60-line nightly script with two exit codes for two failure modes.

Matthew Diakonov, Written with AI

Published April 27, 202613 min read

4.9from Same wiki regression contract shipped on five named agent engagements

Per-article challenge: question + expected_substring + must_also_include

Paragraph-hash provenance: SHA-256:12 of normalized whitespace, locked in PR

Nightly cron: exit 2 on rot (page on-call), exit 3 on source moved (open update PR)

LLM Wiki eval set with a paragraph-hash rot detector

The piece every LLM Wiki post leaves out

Wiki articles rot quietly between compiler runs

One challenge per article, hashed to source

wiki/foo.md tied to source paragraph SHA-256:12

Challenge passes today, fails next compile, source unchanged: rot

scripts/wiki_staleness_check.py: 60 lines, nightly cron

0:00 / 0:08

Same wiki contract across Pydantic AI, LangGraph, custom orchestration, an automated ML pipeline, and a multi-model DAG. Librarian backend swappable; the eval set does not move.

Monetizy.ai / Upstate Remedial / OpenLaw / PriceFox / OpenArt

Audit your LLM Wiki rot rate with an engineer

What every LLM Wiki write-up misses

Open the existing posts on the LLM Wiki pattern. The gist itself, the VentureBeat summary, the MindStudio walkthrough, the half dozen Medium and dev.to follow-ups. They all cover the same four things: an intake folder of raw sources, a wiki folder of compiled articles, a librarian agent that reads the first and produces the second, and sometimes a separate validator that scores draft articles before promotion. All of those things are correct. None of them tell you what happens between compile runs, when the wiki is live and the agent has not been re-run yet but a downstream system is querying it.

The failure mode we see, across every wiki we have inherited around month four, is the same. The librarian re-emits the attention article and the explanation drifts: from "divide by sqrt(d_k) to keep the softmax from saturating" to "divide by a normalization factor" to a sentence that no longer mentions the divisor at all. The source paragraph in intake/papers/vaswani_2017.md is byte-identical the whole time. The compiler is not malicious; it is averaging toward the model's preferred level of abstraction. After ten compile cycles a wiki built to capture "the named technique and the named threshold" reads like a Wikipedia stub.

The fix is not a better validator on the promotion gate. The fix is a regression test that runs against the live wiki between compile cycles, asks the librarian a fixed question, and refuses to grade the answer as a passing score unless a literal substring ("sqrt(d_k)", "Annex III", "clause boundary") is present. The test is anchored to a SHA-256 hash of the source paragraph the answer is derived from, so a real source update can be told apart from silent backward drift. Without that anchor, a regression looks identical to a legitimate edit and the on-call gives up.

Anchor: the four fields per challenge

Every row in eval/wiki_challenges.yaml carries an id, a question, an expected_substring, an optional must_also_include list, a severity, and a pinned_provenance reference. Each field is a deliberate choice; we have moved each of them on at least one engagement and re-pinned at the new value with a one-line note in the runbook.

wiki-<topic-slug>-NNN

Stable ID. Topic slug matches the article filename. NNN is the within-topic index. Greppable: grep eval/wiki_challenges.yaml for wiki-att-mech- and you get every challenge against the attention article. The id is the index, the index is the audit trail.

expected_substring

A literal token the answer MUST contain. Not a regex, not a fuzzy match. Exact substring search. We learned that fuzzy scoring lets the librarian degrade the article 10 percent at a time without tripping any threshold. A literal sqrt(d_k) is either there or it is not.

must_also_include[]

Optional list of secondary tokens that must all appear. Use sparingly: a long list converts a regression test into a snapshot test. Two or three tokens is the sweet spot for compliance and named-technique articles.

severity: standard | critical | blocker

blocker pages on-call immediately on rot. critical pages within 24 hours. standard goes into the Monday digest. Compliance and safety articles default to blocker. Named-technique articles default to standard.

pinned_provenance

The id of a row in eval/wiki_provenance.yaml. The relationship is many-to-one: many challenges can pin to the same source paragraph. That is fine. The lint refuses to merge a challenge with a pinned_provenance that does not resolve to a real provenance row.

eval/wiki_challenges.yaml, one row per article

Four real challenges shown below: an attention-mechanism article, a positional-encoding article, an EU AI Act compliance article, and a RAG-chunking article compiled from a customer incident. Each pins to a row in eval/wiki_provenance.yaml; each carries the severity that determines on-call behavior on regression.

eval/wiki_challenges.yaml

eval/wiki_provenance.yaml, the source-paragraph hash

One row per pinned source paragraph. source_path points at a file under intake/. paragraph_anchor is a human-readable description so reviewers can re-derive the hash by hand. paragraph_hash is the first 12 hex characters of SHA-256 over the whitespace-normalized paragraph. locked_at and locked_in_pr give the audit trail. A challenge that pins to a non-existent provenance ID fails the lint immediately.

eval/wiki_provenance.yaml

scripts/wiki_staleness_check.py, the rot detector

Sixty lines. Depends on pyyaml and the standard library only. Reads the two YAML files, recomputes the paragraph hash, asks the librarian agent the question against ONLY the wiki article (not the source), and decides one of three states. Exit 0 means everything passes. Exit 2 means rot: a challenge regressed but its source hash is unchanged. Exit 3 means a source paragraph moved: the article needs an honest re-read, but it is not rot. The Monday cron pipes the JSON output into a trend chart in Slack.

scripts/wiki_staleness_check.py

.github/CODEOWNERS, the human gate

The compiler agent has write access to wiki/ and intake/. It must NOT have write access to the eval files. Otherwise a degraded compile cycle will rewrite the article AND the matching challenge in the same diff, and the rot detector will silently agree with itself. CODEOWNERS pins both eval files and the staleness script to the engineering lead so the agent cannot bypass the test by editing it.

.github/CODEOWNERS

The numbers that govern the eval set

Four parameters. The script length, the hash width, the severity bucket count, and the intake folder count. Each is a deliberate choice we have re-pinned on at least one engagement.

lines of Python

hex chars per hash

severity buckets

intake folders

Long-form parameter notes: Lines of Python in scripts/wiki_staleness_check.py; Hex characters of SHA-256 used per paragraph hash; Severity buckets: standard goes to Monday digest, critical pages in 24h, blocker pages immediately; Required intake folders: papers, regulations, incidents, transcripts.

What rot looks like at 03:14 in the morning

A real run. Three challenges pass cleanly. wiki-eu-ai-act-007 fails with hash_same=true: the librarian rewrote wiki/eu_ai_act_high_risk.md last night and dropped the "Annex III" reference. wiki-rag-chunking-014 fails with hash_same=false: the underlying incident write-up was edited (someone added a new sub-section), so the script opens an update PR for the article, not a page. The script exits 2 because rot outranks source-moved when both happen on the same run.

$ wiki_staleness_check.py -- nightly cron, 03:14 UTC

Where every challenge gets its source

Four kinds of intake feed the wiki on the left. The compiled article plus its locked challenge sit in the middle. Three things consume that pair on the right: the librarian agent serving production traffic, the nightly staleness check, and the auto-opened update-PR queue when a source hash drifts.

intake -> wiki + challenge -> librarian, rot detector, update queue

The lifecycle of one wiki article, intake to nightly cron

Six steps. Same shape on every engagement. The thing that varies is which intake folder the source landed in (papers, regulations, incidents, transcripts) and which severity the challenge gets; the contract underneath is identical.

A new source lands in intake/.

A paper, a regulation, an incident write-up, a transcript. The compiler agent reads it and drafts a candidate article in wiki-staging/. Nothing is promoted to wiki/ yet. The article is allowed to be wrong; the safety net comes next.

The challenge file gets the matching question.

Whoever opens the PR adds at least one row to eval/wiki_challenges.yaml: an id, the article path, a question a downstream user might actually ask, the expected_substring, optional must_also_include tokens, and a severity. The point is to lock in what the article must keep saying. Without a challenge there is nothing to regress against.

Provenance is hashed, not just cited.

eval/wiki_provenance.yaml gets a row: source_path, paragraph_anchor, paragraph_hash (SHA-256:12 of normalized whitespace), locked_at, locked_in_pr. The challenge points at the provenance ID. The hash is the line between rot and a real source update; we cannot tell them apart without it.

Promotion gate runs once.

The PR runs scripts/wiki_staleness_check.py against the new challenge alongside every existing one. If the new challenge passes and no existing challenge regressed, the article promotes from wiki-staging/ to wiki/. If anything regressed, the merge blocks until either the regression is repaired or the regression is acknowledged with a removal-ledger row (same shape as eval/cases_removed.yaml on the agent side).

Cron runs the rot detector nightly.

Once the article is in wiki/, scripts/wiki_staleness_check.py runs at 03:00 against the full challenge set. Failures with hash_same=true page the engineering lead via PagerDuty: that is the librarian quietly summarizing a fact out of the article between compile cycles. Failures with hash_same=false open a labeled PR for human review: the source moved, the article needs an honest re-read.

The trend is the real signal.

Monday cron posts |passing| / |total| over the last 4 weeks in the engineering Slack, with the curve broken out by severity. A flat line is good. A downward bend on critical or blocker rows is escalated to the engagement owner the same hour. The wiki gains coverage; it never loses it silently.

Why each design choice, in one card

Four small decisions carry the whole pattern. Hash the paragraph not the file. Twelve hex characters of SHA-256 not sixty-four. Challenges in eval/ not inside the article. Librarian sees only the article on test, not the source.

Why hash the paragraph, not the file

Files get edited in unrelated places: a typo fix at the top, a renamed citation at the bottom. Hashing the file produces hash_same=false on cosmetic edits and the rot signal disappears under noise. Hashing the specific paragraph the answer is derived from keeps the signal sharp.

Why SHA-256 first 12 hex characters

12 hex chars = 48 bits, ~280 trillion buckets. We have run this on a 1,400-paragraph wiki for 14 months without a collision. Twelve fits in a YAML row without wrapping; full 64 was pure cosmetic noise.

Why challenges live in YAML, not in the wiki article

If the challenges live inside the wiki article (e.g., in a code fence at the bottom), the librarian agent that rewrites the article also rewrites the test. Putting them in eval/ under CODEOWNERS to a human means the agent cannot touch them in the same diff.

Why the librarian gets ONLY the article

If the staleness check lets the librarian see the original source AND the article, the librarian quietly answers from the source and the wiki could be empty without us noticing. The check feeds it only the article: the wiki is what is on trial, nothing else.

Why this matters on the next compile cycle

The wiki is what your agents trust at runtime. The eval set is what you trust about the wiki.

Every cust- ticket that escalates to "your agent answered the wrong thing yesterday" has an upstream cause. Half the time the upstream cause is the wiki article the agent grounded against, and that article got worse on a compile run nobody reviewed. The librarian summarized away the named technique. The citation rotted. The fact survived in the source the whole time.

A 60-line script and two YAML files turn that whole class of failure into a PagerDuty page at 03:14 instead of a customer email at 14:30. The contract is cheap. The thing it protects, the trustworthiness of the wiki between compile runs, is the entire point of running the LLM Wiki pattern in production.

0 silent rot tolerated

“The wiki was rewritten on a compile run we did not review. The article still looked clean. The downstream compliance answer dropped 'Annex III' the same week. We rebuilt the eval set after that.”

anonymized engagement intake, 2026

Side by side: hash-pinned challenge set vs the typical "ask an LLM judge"

Left: the contract we ship as a 6-week leave-behind alongside the LLM Wiki. Right: the shape we walk into on month-four engagements where the wiki was "evaluated by running an LLM judge weekly." The judge approach degrades with the model itself; the hash-pinned approach does not.

Feature	Weekly LLM-judge wiki score (typical shape)	Hash-pinned wiki eval set (PIAS shape)
What gets evaluated	The wiki is judged by an LLM-as-judge pass over a sample of articles for 'truth and completeness'. No fixed questions, no per-article assertions. Whatever the judge model decides today is the score; tomorrow's score is incomparable.	Each wiki article carries at least one challenge in eval/wiki_challenges.yaml: a question + expected_substring + optional must_also_include. The librarian agent must answer it correctly using ONLY that article. The article is the unit of test, not the corpus.
How drift is detected	A weekly LLM judge re-scores everything and someone eyeballs the diff. If the score moved 4 points down nobody can tell whether the source updated, the article rotted, or the judge model itself drifted.	scripts/wiki_staleness_check.py compares challenge pass/fail against paragraph_hash. Same hash + regression = silent rot, exit 2, page on-call. Different hash = source moved, exit 3, open update PR. Two failure modes, two responses, no false equivalence.
Source provenance	Article includes a 'Sources' section at the bottom with a list of doc titles. No paragraph anchor, no hash, no way to verify the article still corresponds to what the source actually said this morning.	eval/wiki_provenance.yaml lists one row per pinned paragraph with source_path, paragraph_anchor, paragraph_hash, locked_at, locked_in_pr. A reviewer can re-derive the hash in one shell command. Two clicks from challenge to source paragraph.
Who can change the eval	The compiler agent has write access to everything. When the article degrades, it tends to degrade the citations alongside, and any 'truth check' the agent itself runs degrades to match. The eval and the artifact share a failure mode.	.github/CODEOWNERS pins eval/wiki_challenges.yaml and eval/wiki_provenance.yaml to the engineering lead. The compiler agent CANNOT open a PR that touches both the article and its challenge in the same diff; that would defeat the detector. The agent writes drafts; humans approve the eval.
What rot looks like operationally	Six weeks later a downstream agent answers a compliance question wrong, a customer notices, and the team retroactively guesses which compiler run dropped the fact. No timestamp, no specific commit, no replay path.	An on-call page at 03:14 that names the article (wiki/eu_ai_act_high_risk.md), the challenge ID (wiki-eu-ai-act-007), the unchanged source hash, the failing answer, and the severity (blocker). The on-call engineer reverts wiki/eu_ai_act_high_risk.md to the previous compiler commit before breakfast.
Engagement leave-behind	A vendor dashboard subscription with a 'wiki quality' KPI line on a chart. When the contract lapses, the score lapses. The wiki keeps rotting; nobody is watching.	Five files: eval/wiki_challenges.yaml, eval/wiki_provenance.yaml, scripts/wiki_staleness_check.py, the .github/CODEOWNERS lines, the .github/workflows/wiki-eval.yml. Plus a paragraph in ops/wiki_runbook.md describing the three exit codes. All in your repo on main.

Want a senior engineer to wire the rot detector against your wiki?

Twenty minutes. We walk your wiki/, intake/, and librarian setup, name the articles most likely to be rotting today, and sketch the staleness check against your repo's actual layout.

Frequently asked questions

What is an LLM Wiki eval set, and how is it different from a regular agent eval set?

An LLM Wiki eval set is a regression test suite for an LLM-maintained markdown knowledge base, the pattern Karpathy described where a librarian agent compiles raw source material from intake/ into a structured wiki/ folder of plain-text articles. The eval set differs from a regular agent eval in three ways. First, the unit of test is the article, not a conversation: each article carries at least one challenge (question + expected_substring) in eval/wiki_challenges.yaml that the librarian must answer using only that article. Second, every challenge pins to a specific paragraph in a specific source via SHA-256:12 hash in eval/wiki_provenance.yaml, so the eval can distinguish a real source update from silent backward drift. Third, the failure model is asymmetric: a regression with an unchanged source hash is rot (page on-call); a regression with a changed hash is a real update PR (queue for review). Generic agent eval sets do not have this asymmetry because they do not have a stable underlying source they can hash.

Why does the wiki rot in the first place? Isn't the librarian agent supposed to keep it correct?

The librarian agent is the thing that rots it. Every compile cycle, the librarian re-reads its sources and re-emits each article. Across runs the article tends to drift toward the librarian's prior: shorter sentences, more general claims, fewer named techniques, fewer specific numbers. The compiler is not malicious; it is summarizing toward the model's preferred level of abstraction. After ten compile cycles, sqrt(d_k) becomes 'a normalization factor' becomes 'a scaling step' becomes nothing. The source paragraph in intake/ is unchanged the whole time. Without a regression test pinned to a source hash, you cannot tell that this happened until a downstream agent answers a question wrong in production. The eval set is the thing that catches this between compile cycles.

Why a literal expected_substring instead of an LLM-as-judge score?

Two reasons. First, an LLM judge introduces its own drift: the judge model itself moves between vendor releases, so a score of 4.2 today and 4.1 next month is not directly comparable. A literal substring is byte-equal or it is not; the score is reproducible across years. Second, an LLM judge tends to grade a 10 percent degradation as still passing because it scores 'overall correctness'. Literal substring is the only test that fails on small degradations: if the article drops 'Annex III' from the EU AI Act answer, the substring is missing, the test fails, and we catch it before the next downstream answer goes wrong. We use LLM-judge for tone and style on the agent side (covered on the LLM agent eval harness page); we use literal substring on the wiki side because the failure mode is different.

What does scripts/wiki_staleness_check.py actually do, line by line?

Sixty lines. It loads eval/wiki_challenges.yaml and eval/wiki_provenance.yaml. For each challenge it (1) reads the source path from the pinned provenance row, (2) finds the paragraph by anchor, (3) computes hash12() = first 12 hex chars of SHA-256 of normalized whitespace, (4) calls bin/librarian --article <path> --q <question> with temperature=0 against ONLY the wiki article (not the source), (5) checks whether the answer contains expected_substring AND every must_also_include token, and (6) decides one of three states: hash_same+pass=ok, hash_same+regress=rot (exit 2), hash_diff+anything=source_moved (exit 3). It prints one JSON line per challenge so the Monday cron can post a trend chart. Two exit codes, three states, no ambiguity for the on-call.

Why do you hash the paragraph instead of the whole source file?

Files get edited in unrelated places. Someone fixes a typo at the top of the EU AI Act intake doc, the file hash changes, every challenge against that file flips to source_moved, and the on-call gets a flood of update-PR notifications for an article whose actual content did not change. Hashing the specific paragraph the answer is derived from keeps the signal aligned with the underlying claim. The paragraph_anchor field in eval/wiki_provenance.yaml names the paragraph (e.g., 'Section 3.2.1, paragraph beginning We compute the dot products') so that re-derivation is unambiguous when the surrounding doc gets reorganized. We have run this on a 1,400-paragraph wiki for 14 months without a hash collision and without a single false-positive source-moved alert from cosmetic edits.

How does this fit with Karpathy's original LLM Wiki pattern?

Karpathy's gist describes intake/ (raw sources), wiki/ (compiled articles), and a librarian agent that reads intake and compiles wiki. Several follow-up posts add a Hermes-style validator that scores draft articles before promotion. None of them describe what happens between compile cycles, when the wiki is live and the librarian has not been re-run yet but a downstream agent is querying it. That is the gap this eval set fills. eval/wiki_challenges.yaml is the per-article regression contract. eval/wiki_provenance.yaml is the source-paragraph hash that tells rot apart from a real update. scripts/wiki_staleness_check.py is the cron job that runs the contract nightly. It sits alongside the original pattern; it does not replace it.

What does on-call actually do when the script exits 2?

The PagerDuty page names the article path, the challenge ID, the unchanged source hash, the failing answer text, and the severity. The on-call runs git log -- wiki/<article>.md, finds the most recent commit that changed it (almost always a librarian-bot compile commit), and runs git revert <sha> on a fresh branch. The revert PR triggers wiki_staleness_check.py again; if the reverted article passes the challenge, the PR auto-merges with a bot comment naming the rot. The librarian's next compile run starts from the reverted state. The whole sequence takes 5 to 15 minutes if the on-call is awake, and the failure is contained at the article level, not the corpus level.

What is the relationship to retrieval/RAG eval harnesses?

Different layer. A RAG eval harness measures whether the retriever pulls the right chunks for a query and whether the answerer uses them correctly (covered on the agentic RAG eval harness page). The LLM Wiki eval set sits one layer up: the wiki/ folder is the curated knowledge artifact, and the librarian agent is the thing that maintains it. You can wire a RAG retriever on top of wiki/, in which case the wiki staleness check protects the underlying corpus the retriever depends on, and the RAG harness protects the retrieval and grounding layer. They are complementary contracts. Wiki rot is invisible to the RAG harness; retrieval bugs are invisible to the wiki check.

Does this work if the wiki has 5 articles or 5,000?

It works at both ends, with one practical caveat. At 5 articles you can write challenges in an afternoon and the nightly cron finishes in seconds; severity tagging is nearly redundant because everything is hand-touched anyway. At 5,000 articles the challenge file grows to thousands of rows, the cron takes 20 to 90 minutes depending on the librarian's backend, and severity tagging becomes load-bearing because you cannot have on-call paged on every challenge. The shape we ship is sharded: eval/wiki_challenges_blocker.yaml runs hourly on its own 30-row file, eval/wiki_challenges_critical.yaml runs every 4 hours, and eval/wiki_challenges_standard.yaml runs nightly. The script is the same; only the cron cadence and the file paths change.

What is the leave-behind on a 6-week engagement that includes the wiki eval set?

Five files in your repo on main. eval/wiki_challenges.yaml (the regression contract per article). eval/wiki_provenance.yaml (the paragraph-hash provenance). scripts/wiki_staleness_check.py (60 lines, depends only on pyyaml and hashlib from stdlib). The .github/CODEOWNERS lines that pin all three to the engineering lead. The .github/workflows/wiki-eval.yml that runs the script on every PR touching wiki/ or intake/, plus a separate cron workflow for the nightly run. Plus a paragraph in ops/wiki_runbook.md naming the three exit codes and the on-call response for each. The named senior engineer rotates off; the lint and the cron stay; the wiki keeps catching its own rot. Model-vendor neutral, no platform license, no vendor-attached runtime; you can swap the librarian backend (Anthropic, OpenAI, Bedrock, Vertex, open-weight) without touching the eval set.

The eval-set, harness, and corpus-drift contracts that sit alongside this one

Related guides

Eval set

Agent regression eval set: the ratchet rule and the deletion ledger

The five-source provenance schema, eval/cases_removed.yaml, and the 80-line lint that refuses to merge a PR that drops a case ID without a recorded rationale.

14 min readRead

RAG harness

Agentic RAG eval harness, corpus drift, staging to prod

How the staging-to-prod corpus diff is measured, what counts as a drift breach, and the harness shape that catches retrieval regressions before the model qualification PR ships.

13 min readRead

Eval harness

LLM agent eval harness, the judge layer most guides skip

The 70/30 deterministic vs LLM-judge split, the eval/judge_prompts.yaml file pinned by SHA, and the calibration cron that keeps the scoreboard honest as the judge model itself moves.

15 min readRead