Guide, topic: llm agent eval harness, 2026

Your LLM agent eval harness is silently rotting because the judge is itself a model.

Most articles on this topic frame the eval harness as a rubric file plus a case set plus a CI job. That is the agent layer. The half nobody talks about is the judge layer: the LLM that grades subjective axes is itself a moving target, and a single model release or a one-sentence prompt tweak underneath it invalidates every historical score on the scoreboard. This guide is the inverse. The eval/judge_prompts.yaml file we pin per axis with a prompt SHA, the 60-line scripts/calibrate-judges.sh cron we run every Friday against 20 frozen cases, the 70/30 split between deterministic asserts and LLM-judge asserts, and the per-axis drift threshold in points that opens a PR before the scoreboard goes uncomparable.

M
Matthew Diakonov
15 min read
4.9from Same judge layer shipped across 5 named production agents
judge_prompts.yaml pinned per axis, by SHA, never inlined
Weekly cron re-runs N=20 frozen cases per axis; drift > 3 points opens a PR
70/30 split: 7 deterministic axes carry most of the rubric, 3 axes need the LLM

Same judge layer across Pydantic AI, LangGraph, custom orchestration, automated ML pipeline, and a multi-model DAG.

Monetizy.ai / Upstate Remedial / OpenLaw / PriceFox / OpenArt

Calibrate the judge layer with an engineer
eval/judge_prompts.yamleval/judge_calibration/eval/judge_history.jsonlscripts/calibrate-judges.shrubric.yamlragasDeepEvalpromptfooG-EvalMT-Benchjudge_model: claude-4-7-opusjudge_model: gpt-4-1drift_threshold_points: 3calibration_set_id: cal-2026-04-21prompt_template_sha256: 0x9af1c2

What every other guide on this gets wrong

Open the guides on this topic that landed in the last quarter. They cover three things: a rubric file with thresholds, a cases file with workload-grounded examples, and a CI job that runs the rubric on every PR. All three are correct. None of them tell you what happens to the LLM that grades the subjective axes when its model release ships, or when an engineer rewrites a single sentence of the judge prompt to fix a perceived false positive in week 8.

Both events change the scoreboard. The agent did not change. The cases did not change. The rubric thresholds did not change. The judge changed, and now week-8 scores are not comparable to week-1 scores. Most teams notice this around month four, when a regression they would have caught at week 2 slips past because the bar moved underneath them. The team's response is usually 'recalibrate the judge', which without a pinned prompt and a frozen calibration set is a euphemism for 'guess at new numbers'.

The fix is to treat the judge the way you treat the agent: pin its model name and its prompt by SHA in a YAML file, freeze 20 calibration cases per axis with a target score two humans agreed on, and run a weekly cron that grades the calibration set with the pin and reports the drift in points. Drift over the threshold opens a PR. The scoreboard stays comparable for as long as the pin holds.

Anchor: the 70/30 split between deterministic and LLM-judge axes

Most rubric axes are not subjective. They are bug-tight asserts you can write with regex, JSON-schema, or a number compare. Running them through an LLM judge costs a forward pass per case, adds latency to the harness, and introduces variance that swamps real regressions. The judge belongs on the axes where it genuinely cannot be replaced.

Deterministic, 70 percent

schema_valid, must_not_include, calls_correct_tool, citations_resolve, latency_p95_under, token_budget, no_repeated_text. Pure code asserts. 0 tokens. Run in milliseconds. The bug-tight floor of the harness.

LLM-judge, 30 percent

answer_relevancy, factual_grounding, tone_match. Pinned by SHA. Cross-vendor where the bias matters. Re-calibrated weekly on a frozen 20-case set per axis.

Things that look subjective but are not

Generation loops feel subjective ('it sounds repetitive') but a 50-character n-gram repeated three times is a deterministic check. Whenever a 'subjective' axis can be expressed as a regex or a count, it goes in the deterministic 70.

Things that look objective but are not

factual_grounding looks like a citation check, but the agent can cite the right doc and still mis-state what it says. The judge has to read both. Same with tone_match: a regex cannot tell you it sounds like the brand voice.

Side by side: which axes are deterministic, which need the LLM

Anchor split

7 deterministic axes. 3 LLM-judge axes. Same eval/run.py.

Deterministic, 70 percent

  • schema_valid Output parses against a Pydantic model. Pure regex/JSON-schema check. No model needed. Cost: 0 tokens. Runs in 4ms per case.
  • must_not_include List of strings that are flat-out illegal in the output (system prompt leak, competitor name, PII keys). String contains. No model.
  • calls_correct_tool When the case expects a tool call, assert the tool name and the JSON shape of the args. Read it from the trace, not from a judge.
  • citations_resolve If the agent cites a doc, assert the doc id exists in your retrieval index. A regex extracts ids; a set membership check confirms them.
  • latency_p95_under Read latency from the trace. Assert against a number. Latency regressions are deterministic and you must catch them in CI, not in prod.
  • token_budget Total input plus output tokens per case is a number. Assert it stays under a per-axis ceiling. The judge cannot make this question subjective.
  • no_repeated_text A 50-character n-gram repeated three times in a single output is a generation loop. Catch with a deterministic check; the LLM judge will rate it 'fluent' and pass it.

LLM-judge, 30 percent

  • answer_relevancy Does the response actually address the user's question? An LLM judge with a frozen prompt scores 0/1/2/3. ragas-style. Pinned by SHA.
  • factual_grounding Are the asserted facts supported by the cited documents? An LLM judge reads the docs and the response and votes. Cannot be regex'd.
  • tone_match Does the email match the brand voice? Subjective and irreducible. An LLM judge with a calibrated prompt scores against three reference styles.

eval/judge_prompts.yaml, in full

The file the calibration cron reads. Each LLM-judge axis in rubric.yaml has exactly one row here. judge_model is named explicitly. prompt_template_sha256 is a hash of the file at prompt_template_path on disk. drift_threshold_points is the number of points of drift that triggers a PR. The cron will refuse to run if the SHA on disk does not match the pin; if you change the prompt, you change the SHA, and you do that in a separate PR titled judge-pin.

eval/judge_prompts.yaml

scripts/calibrate-judges.sh, in full

Sixty lines of bash. No deps beyond bash, yq, and python. Every Friday at 14:00 UTC the cron runs each axis against its 20-case frozen calibration set with the pinned judge model and the pinned prompt template, computes the drift in points, appends one row to eval/judge_history.jsonl, and either passes or opens a PR. The script lives in the client repo on every PIAS engagement.

scripts/calibrate-judges.sh

How the runner applies the split

The runner reads rubric.yaml, walks every axis, and routes each one to the deterministic checker or the pinned-judge call. The assert_pin_matches_disk line is the gate: if the SHA in eval/judge_prompts.yaml does not match the SHA of the file at prompt_template_path, the runner refuses to grade that axis. Drift is impossible to sneak in. Either the pin matches or the runner stops.

eval/run.py

What a clean Friday cron looks like

Three axes. Twenty cases each. Drift well under the three-point threshold. No PR opens, the run line in eval/judge_history.jsonl appends, and the harness keeps gating merges with the same calibration baseline.

judge-calibrate.yml -- 3 of 3 PASS

What a Friday cron with judge drift looks like

factual_grounding drifts 4.7 points after a routine claude-4-7-opus point release. The cron exits non-zero, opens a PR titled judge-drift detected on 2026-04-26, and stops gating merges on factual_grounding until a human triages. The agent rubric still grades; the drifted axis just no longer blocks the build.

judge-calibrate.yml -- 1 of 3 FAIL, drift PR opened

How the calibration loop wires together

Three inputs feed the calibration cron on the left. Three outputs land in the repo and the PR thread on the right. Nothing vendor-hosted in between; the judge spec lives in YAML, the history lives in a JSONL committed back to the repo, and the drift PR opens against your default branch.

rubric + cases + calibration -> calibrate-judges.sh -> history + drift PR

rubric.yaml
eval/cases.yaml
eval/judge_calibration/
scripts/calibrate-judges.sh
eval/judge_history.jsonl
judge-drift PR
PR scorecard

The weekly cadence, in order

One Friday tick. Six steps. Same shape every week. The thing that changes is the drift number and whether a PR opens.

1

Friday 14:00 UTC: judge-calibrate.yml fires

GitHub Actions cron runs scripts/calibrate-judges.sh on a fresh checkout of main. The job uses the pinned judge_model and the prompt_template_sha256 from eval/judge_prompts.yaml. Nothing on the agent side runs in this job. The agent code is irrelevant to whether the judge has drifted.

2

Each axis re-runs against its 20-case calibration set

Each LLM-judge axis has a frozen calibration set in eval/judge_calibration/<axis>.yaml. Every case has a target score that two humans agreed on during the week-2 scoping call. The job re-grades all 20 cases with the pinned judge and computes the absolute-diff drift in points.

3

Drift result appends to eval/judge_history.jsonl

One JSON line per axis per week: date, axis, judge_model, drift_points, threshold. The file is committed back to the repo by the cron, never lives in a vendor dashboard. A year of judge stability is one grep eval/judge_history.jsonl away.

4

Drift > threshold opens a 'judge-drift detected' PR

The PR is titled judge-drift detected on YYYY-MM-DD (N axis(es)). The body links the per-axis report and a human-review template. Until a human triages and either re-pins the prompt or accepts the drift, the affected axis is held at its previous threshold. The agent rubric still grades; the drifted axis just does not gate merges.

5

Re-pin or extend calibration in a single PR

The triaging engineer either rewrites the judge prompt (new prompt_template_sha256, new last_calibration_date, fresh 20-case calibration), or accepts the drift and tightens the calibration set. Either path is a single PR that touches eval/judge_prompts.yaml plus its calibration_set file. The PR cannot land without a passing scripts/calibrate-judges.sh run.

6

rubric.yaml is unaffected. Cases.yaml is unaffected.

The judge layer is the only thing that moves on this cadence. The rubric thresholds (rubric_min_score, ragas_faithfulness_min) and the case set in eval/cases.yaml are owned by the agent team on a separate cadence. Decoupling those is what lets you swap a model release on Friday without re-running the entire benchmark from scratch.

Receipts: calibrated judge layer vs the playbook in most existing pages

Left: the judge layer we ship. Right: the LLM-as-judge configuration shape that shows up in most existing pages on this topic. Neither column is wrong. The right column is incomplete: it tells you which library to call, not what to do when the library moves underneath you.

FeatureLibrary-default LLM-as-judgePinned judge layer + weekly calibration
How LLM-as-judge is treatedA configuration choice: 'use ragas' or 'use DeepEval', with no axis-level pinningPer-axis spec in eval/judge_prompts.yaml: judge_id, judge_model, prompt_template_sha256, calibration_set_id, drift_threshold_points
Judge-prompt driftNot addressed. The judge prompt is wherever ragas/DeepEval ships it; you do not see it move.Weekly cron re-runs N=20 frozen cases per axis. Drift > 3 points opens a PR. eval/judge_history.jsonl is committed to the repo.
Judge-model swapDefaults to whatever 'gpt-4' or 'claude-3-5' is wired up. A vendor sunset can move it underneath you.judge_model is one line in judge_prompts.yaml. Changing it is a separate PR titled 'judge-pin' and requires a fresh calibration run attached.
Cross-vendor judgeUsually same-vendor by default; the 'judge grades its own family' bias is treated as a research aside.The judge for tone_match is GPT while the agent is Claude. Same-model judge-grades-itself drift is the worst-known shape; we ship cross-vendor by default on subjective axes.
Deterministic vs LLM-judge splitEverything goes through the LLM judge. The eval bill scales with the agent and the harness latency makes CI feel slow.70/30 split. Seven deterministic axes (schema, must_not_include, tool calls, citations, latency, tokens, repetition) cover most rubric lines; the LLM judge stays on three axes that genuinely need it.
Where the judge spec livesInside the SaaS dashboard (LangSmith, Braintrust). The harness is portable; the judge spec is not.Plain YAML in your repo. Versioned by PR. Survives the engineer leaving.
Definition of 'the harness is calibrated'The harness is calibrated when the engineer remembers to spot-check. There is no file you can grep to prove it.git log shows scripts/calibrate-judges.sh has run on at least four of the last six Fridays, with eval/judge_history.jsonl appending each week.

The 8-bullet pass-state of a calibrated judge layer

  • Every LLM-judge axis in rubric.yaml has a row in eval/judge_prompts.yaml
  • judge_model is named explicitly per axis (not 'gpt-4' as a default)
  • prompt_template_sha256 matches the file at prompt_template_path on disk
  • calibration_set_id points at a real eval/judge_calibration/<axis>.yaml file with >= 20 rows
  • scripts/calibrate-judges.sh has been run on at least 4 of the last 6 Fridays
  • eval/judge_history.jsonl appends one row per axis per cron run
  • drift_threshold_points is set to <= 3 (the empirical floor on a 20-case set)
  • At least one judge axis is cross-vendor (judge model from a different family than the agent model)
3 of 3

Three LLM-judge axes per agent, all pinned by SHA, all calibrated weekly against a 20-case frozen set. Same shape across Pydantic AI on Monetizy, LangGraph + Bedrock on Upstate Remedial, custom orchestration on OpenLaw, automated nightly DAG on PriceFox, and multi-model pipeline on OpenArt. The judge layer is the half of the harness that survives the engineer leaving.

PIAS leave-behind across 5 named production agents, model-vendor neutral

Counts that anchor the rest of the page

Engagement-level facts, not invented benchmarks. Per-client production metrics live on /wins.

0%Of rubric axes are deterministic in the harness shape we ship
0Frozen calibration cases per LLM-judge axis, scored by 2 humans at week 2
0 ptsDrift threshold per axis. Above this, a PR opens, not a Slack ping
0Lines in scripts/calibrate-judges.sh, no deps beyond bash + yq + python

0% deterministic. 0 calibration cases per axis. 0 pts drift threshold. 0 lines of bash. The four numbers that make the judge layer either a control or a vibe.

Want a senior engineer to pin your judge layer and write the calibration cron?

60-minute scoping call with the engineer who would own the build. You leave with the 70/30 split applied to your rubric, eval/judge_prompts.yaml drafted against your axes, and a fixed weekly rate to ship the calibration cron and the first re-pin PR.

LLM eval harness, the judge layer, the calibration cron, answered

What is an LLM agent eval harness, beyond the rubric file?

It is the set of files that turn 'we have evals' into a control the on-call engineer trusts. Five components: rubric.yaml at the repo root that names every axis, eval/cases.yaml with the workload-grounded benchmark, eval/judge_prompts.yaml that pins the judge model and the prompt template SHA per LLM-judge axis, scripts/calibrate-judges.sh that re-runs the judge against a 20-case frozen set every Friday, and eval/judge_history.jsonl committed back to the repo so the on-call engineer can read a year of judge stability with one grep. The runner (eval/run.py) ties them together and the CI workflow (.github/workflows/eval.yml) posts the scorecard on every PR. Without the judge layer pinned, the eval scoreboard slowly diverges from reality and you only notice when a regression slips past.

Why split rubric axes 70/30 between deterministic and LLM-judge?

Because most rubric axes are not actually subjective and the LLM judge taxes you for nothing on those. schema_valid, must_not_include, calls_correct_tool, citations_resolve, latency, token budget, and repeated-text detection are all expressible as a regex, a JSON-schema check, or a number assert. Running them through a judge costs a forward pass per case (real money), adds 200 to 1500 ms of latency to the harness, and introduces variance that can swamp real regressions. The judge belongs on the three axes where it cannot be replaced: answer_relevancy, factual_grounding, and tone_match. The 70/30 split is the empirical ratio across the five named production agents we run; your repo will land between 60/40 and 80/20 depending on how much of the workload is text generation versus tool use.

What is judge-prompt drift, and why does it kill harnesses?

Two failure modes that look identical from a metrics dashboard but have different fixes. Judge-prompt drift: an engineer tweaks the judge prompt by one sentence in week 8 to fix a perceived false positive, the prompt SHA changes, and every score from week 8 onward is uncomparable to week 1. Judge-model drift: the vendor releases claude-4-7-opus and the harness has a default that resolves to 'whatever opus is latest', so the judge model swaps under you. Both produce a moving scoreboard the team will trust for one quarter and then stop reading. The fix is a pinned prompt SHA and an explicit judge_model name per axis in eval/judge_prompts.yaml, plus a weekly calibration cron that re-grades a 20-case frozen set and opens a PR if the drift exceeds three points.

Why 20 calibration cases per axis and not 100, and why three points as the drift threshold?

20 is the smallest set we have seen reliably surface real drift while keeping the cron cheap enough to run every Friday. Below 12 the variance dominates, run-to-run noise is louder than the drift you want to detect. At 100 the cron costs roughly five times as much, runs five times longer, and the engineer stops looking at the result. Three points on a four-point scale (0/1/2/3) is the empirical floor: under three points, two humans on the same case set will themselves disagree by that much. Above three points, you are seeing a real prompt or model shift, not noise. Tuning the threshold lower locks the harness too tightly and produces churny PRs that the team learns to ignore. Both numbers are defaults from the same five named production agents; you should re-pick them at week 2 with your product lead.

Should the judge model be the same family as the agent model?

On objective axes (factual_grounding against retrieved docs, schema-derived axes), it does not matter. On subjective axes (tone_match, answer_relevancy on free-text generation), the judge should be a different family than the agent. Same-family judge-grades-itself bias is well-documented: a Claude judge will rate Claude outputs higher on subjective axes than a GPT judge will, by a measurable amount. The fix is to ship cross-vendor on subjective axes by default. In our judge_prompts.yaml shape, tone_match is graded by gpt-4-1 even when the agent is Claude. Same with the inverse on a GPT-based agent: tone_match goes through Claude. The cost is a second vendor key in the harness; the benefit is the scoreboard stops flattering the model you ship.

How is this different from ragas, DeepEval, promptfoo, G-Eval, or LangSmith?

Those are runners and judge libraries. The harness is the wiring that pins them. ragas computes faithfulness and answer_relevancy with built-in prompts; the harness writes the prompt SHA into eval/judge_prompts.yaml and gates merges if the SHA on disk does not match the pin. DeepEval ships G-Eval as a metric; the harness adds the calibration set, the weekly cron, and the drift threshold. promptfoo runs the cases; the harness owns which cases run, on which PRs, and what the failure does. LangSmith is a useful trace store for debugging the judge; it is not a judge spec, and the spec must live in your repo or it leaves with the vendor. None of the libraries answer the question 'has the judge drifted in the last six weeks?'. judge_history.jsonl does.

Where do the judge prompts come from? Do you write them or use library defaults?

Library defaults are the starting point, not the ship state. ragas faithfulness, ragas answer_relevancy, and G-Eval coherence are reasonable v1 prompts; we copy them into eval/judge_prompts/<axis>.md and pin the SHA. The first calibration run scores 20 cases. Every case where the judge disagrees with the human target by more than two points is a row the engineer rewrites the prompt against until the agreement floor is met. The prompt that ships is rarely the library default; usually it is the default plus three to six axis-specific clarifications added during the week-2 scoping call. After that the prompt is frozen by SHA and only the calibration cron is allowed to suggest a re-pin.

What does the judge-drift PR actually look like when it opens?

Title: 'judge-drift detected on YYYY-MM-DD (N axis(es))'. Body: the drift in points per axis, a link to the per-axis JSON report with case-by-case judge output, the diff between the calibration target and what the judge said today, and a checklist with two paths. Path A is re-pin: rewrite the judge prompt, increment the SHA, re-run scripts/calibrate-judges.sh, attach a green run, merge. Path B is accept: extend the calibration set with the rows the judge now scores differently, document the rationale in the PR body, attach a fresh calibration with the new set as cal-YYYY-MM-DD-extended, merge. Either path is one PR. The drifted axis stops gating merges until the PR lands. The agent rubric still grades on every other axis in the meantime.

Does this work for non-text agents (vision, audio, code generation)?

Yes, and the deterministic 70 actually grows. Code outputs add three deterministic axes (compiles, passes a unit test, lints clean) that previously would have been judge calls. Vision adds image-hash checks (the output contains a face in the lower-left third), which a deterministic vision-feature extractor can answer without an LLM judge. Audio adds transcript checks. The LLM-judge 30 shrinks but does not disappear: 'is this a recognizable brand voice in the audio' still needs a judge. The judge_prompts.yaml shape is identical across modalities; only the calibration sets and the deterministic axes differ. The OpenArt multi-scene video agent ships with one judge axis (visual_consistency_across_scenes) and seven deterministic axes (resolution, fps, scene count, transition smoothness from optical flow, NSFW gate, file size, duration).

What is the leave-behind on a 6-week PIAS engagement that includes the judge layer?

Eight files in the client repo on main: rubric.yaml, eval/cases.yaml, eval/judge_prompts.yaml, eval/judge_calibration/<axis>.yaml (one per LLM-judge axis), eval/run.py, scripts/calibrate-judges.sh, .github/workflows/eval.yml, .github/workflows/judge-calibrate.yml. Plus eval/judge_history.jsonl which is generated by the cron and committed back to the repo. The harness is model-vendor neutral: the judge model is named in YAML, no vendor-attached runtime is required, and there is no SaaS license to renew. The named senior engineer who shipped it leaves a runbook in ops/failure_playbook.md that walks the on-call engineer through reading judge_history.jsonl when the agent regresses. Every line of this is your repo, your CI, your on-call rotation. We leave; the harness stays.