Guide, topic: ai agent eval harness, 2026

Your eval harness is shelfware until six shell commands say it is not.

Most articles on AI agent eval harness draw architecture diagrams (guides plus sensors, a six-layer testing stack, four to twelve weeks of build) and never tell you whether the harness in your repo today is actually load-bearing. This guide is the inverse. Six shell commands you run against your repo right now, each with a hard pass or fail. The five pull requests, in order and by day, that produce a harness which passes them. The cases.yaml shape we ship across five named production agents. And the one PR (the first model swap) that proves the harness is wired into the team and not just the repo.

Matthew Diakonov, PIAS, forward-deployed ML engineering

Published April 26, 202614 min read

4.9from 6-question audit, 5-PR build, applied across 5 named production agents

Six shell commands, six hard pass/fails, run on a checkout you already have

Five PRs in 21 days; PR 5 is the first model swap as the proof

Same cases.yaml shape on Pydantic AI, LangGraph, custom orchestration, multi-model DAG

Eval harness, audited in six commands

Five PRs to make it load-bearing

Six shell commands, six honest answers

Five PRs in the order an engineer ships them

cases.yaml from real traffic, not vibes

model_primary swap is a one-line edit

PR comment scorecard or the harness is a notebook

0:00 / 0:08

Same harness shape across Pydantic AI, LangGraph, custom orchestration, automated ML pipeline, and a multi-model DAG.

Monetizy.ai / Upstate Remedial / OpenLaw / PriceFox / OpenArt

Audit the harness with an engineer

rubric.yamleval/cases.yamleval/ragas.jsonl.github/workflows/eval.ymlagents/contracts.pyops/failure_playbook.mdops/runbook.mdARCHITECTURE.mdragasOTELpromptfooDeepEvalScenario red-teampgvectorBedrockAnthropicOpenAI

What every other article on this gets wrong

Open the dozen guides on this topic that landed in the last quarter. They share the same shape. A two-pillar diagram (guides plus sensors). A six-layer testing stack (data validation, unit, integration, E2E, adversarial, CI/CD). A timeline (four to twelve weeks, five thousand to twenty thousand lines of infrastructure). A closing observation that LangChain and OpenAI both moved their agents from average to top-tier by improving their harness, not their model. All of that is true. None of it tells you whether the harness in your repo today is doing its job.

The failure mode they all miss is the one we run into on day zero of most engagements: the harness exists, the eval directory is in the repo, the README points at it, and nothing about it is load-bearing. cases.yaml was last touched 87 days ago. The CI job runs and writes to a build artifact nobody opens. model_primary appears in three Python files, and the last model swap was a four-file PR that took an afternoon. There is no named owner. The harness has never blocked a single PR. The team will tell you in standup that they have evals; their git log will tell you they do not.

That is shelfware. The architecture diagram cannot detect it because the diagram only describes the components, not whether they are wired into the workflow. The audit below detects it in five minutes, by asking six questions whose answers are commands you can run on a checkout you already have.

Anchor: the 6-question shelfware audit, in shell

Six commands. Six hard answers. Run from the root of your agent repo on the same checkout your CI uses. The audit script is sixty lines of bash and lives in the leave behind on every PIAS engagement. A repo that passes all six is a repo where a model swap on Friday does not page anyone on Saturday. A repo that fails on Q3 or Q4 is the common case, and the lowest-numbered failure names the next PR.

Anchor fact

6 commands. 6 hard pass/fails. 5 minutes from clone to verdict.

Q1. Is rubric.yaml a real file on main, or a Notion link? test -f rubric.yaml. If the file does not exist at the repo root on main, the harness is undefined. The single most common shelfware pattern: a one-page rubric in Notion that nothing in the repo references and no PR can be checked against.
Q2. Does model_primary appear exactly once in rubric.yaml? grep -c '^model_primary:' rubric.yaml must return 1. Anything else means the model name is hard-coded in agent code, in a YAML elsewhere, or in three places. A model swap that takes more than a one-line edit is a model swap that gets postponed, and a postponed swap is the shelfware feedback loop in action.
Q3. Has eval/cases.yaml been edited by a human in the last 30 days? git log --since='30 days ago' --pretty=format:'%ae' eval/cases.yaml. Zero commits means the case set is frozen at week 2 and the agent has shipped two model upgrades against an expired benchmark. The harness still runs, but it is grading against the wrong test.
Q4. Does .github/workflows/eval.yml actually post to the PR? find . -path '*workflows/eval.yml' AND grep -l 'pull_request\|issue_comment\|gh pr comment' .github/workflows/eval.yml. A CI job that runs the rubric and writes the score to a build artifact nobody opens is shelfware. The job has to comment on the PR, on every PR that touches agents/, eval/, or rubric.yaml, or the team will not see the regression in the four hours before merge.
Q5. Does rubric.yaml name an on-call rotation and a primary engineer email? grep -E '^(on_call_rotation|primary_engineer):' rubric.yaml. A rubric with no named owner is a rubric that breaks silently. The first model release that fails the rubric will not page anyone, and the harness becomes evidence in a postmortem instead of a control.
Q6. Has the harness ever blocked a PR? grep the git history. git log --all --grep='regression\|eval failed\|rubric blocked' eval/ rubric.yaml .github/workflows/eval.yml. Zero matches in the last six months is the most damning result: the harness has not surfaced a single regression. Either the agent has been perfect (it has not), or the harness is a notebook that nobody reads.

The script that wraps the six commands prints a per-question pass/fail and a verdict line. Six of six is live. Anything less is shelfware on at least one axis, and the lowest-numbered FAIL is the next PR.

audit-eval-harness.sh, in full

Sixty lines of bash, no dependencies beyond bash, git, grep, and find. Drop it at scripts/audit-eval-harness.sh, commit it, and run it on every Friday afternoon as a health check. The CTO can run it during a coffee. The procurement reviewer can run it during a security review.

scripts/audit-eval-harness.sh

What a live harness looks like on a shipped agent

A literal run against the Monetizy.ai-shape repo at the end of a 6-week engagement. Six of six. The harness is wired into the workflow, not just the repo.

monetizy/outbound-agent -- 6 of 6 PASS

What a shelfware harness looks like on a real pre-engagement repo

A run against a representative pre-engagement repo. The eval directory exists, the CI job runs, the README points at it. Five of six come back FAIL. Each FAIL names the next PR.

acme/legacy-agent -- 1 of 6, harness is shelfware

The 5-PR build order that produces a harness which passes the audit

Five pull requests in 21 days. The first four ship the artifacts. The fifth is the first model swap, which is what proves the harness is load-bearing. Most teams stop at PR 3 because the harness exists at that point; the audit was designed to catch teams who stopped at PR 3.

Days 1, 7, 10, 14, 21

PR 1, day 1
rubric.yaml at the repo root. model_primary, model_fallback, thresholds, ownership block. 30 lines of YAML.
PR 2, day 7
eval/cases.yaml with 80 cases drawn from real traffic plus a 12-row adversarial set written with the product lead.
PR 3, day 10
.github/workflows/eval.yml that runs the rubric on every PR touching agents/, eval/, or rubric.yaml and posts a scorecard comment.
PR 4, day 14
ops/failure_playbook.md naming primary, fallback, fallback-of-fallback, with trigger conditions and a links to dashboards.
5
PR 5, day 21
First model swap PR. One-line edit to rubric.yaml. The harness either passes or surfaces the regression. This is the proof.

The same five PRs, with the why for each

Each PR is a discrete artifact and a discrete failure mode it closes. PR 5 is not optional. A harness that has never been used in a model swap is a harness the team has not yet trusted, and untrusted harnesses do not survive the next quarter.

Day 1, PR 1: rubric.yaml at the repo root

30 lines of YAML. model_primary, model_fallback, thresholds (rubric_min_score, ragas_faithfulness_min, ragas_answer_relevancy_min, max_per_case_regression), ownership block (primary_engineer email and on_call_rotation). The first time you commit this file is the first time production-readiness is defined for this workload. Without it the harness has no asserts to make.

Day 7, PR 2: eval/cases.yaml drawn from your real traffic

80 rows minimum. 68 from production_log, 12 adversarial rows the engineer wrote with the product lead during a 90-minute session. Each row has an id, source, input, expected_traits, must_not_include, and rubric_weight. The cases predict incidents on your workload; a vendor leaderboard does not.

Day 10, PR 3: .github/workflows/eval.yml as the gate

Triggers on pull_request when paths include agents/**, eval/**, or rubric.yaml. Runs eval/run.py against the case set. Posts the scorecard with gh pr comment so reviewers see the score in the PR thread, not buried in a build artifact. Fails the build when max_per_case_regression is breached. Without the comment, the harness is a metric without a feedback loop.

Day 14, PR 4: ops/failure_playbook.md linking the rubric to on-call

Names the primary model, the fallback, the fallback-of-fallback, the trigger condition for each, and the dashboards that monitor latency and error rate per provider. Updated in the same PR every time model_primary changes in rubric.yaml. The on-call engineer reads this at 3am Saturday; the harness gives them the score, the playbook gives them the next move.

Day 21, PR 5: the first model swap, as proof the harness works

A one-line edit to rubric.yaml: model_primary changes from one provider release to the next. The CI job runs the rubric, posts the scorecard, and either passes or surfaces a regression with a per-case diff. Either outcome is correct. The wrong outcome is not running the swap because nobody trusts the harness yet. PR 5 is what makes the harness load-bearing for every model release after.

Day 30+, weekly: append to eval/cases.yaml

Every Friday, the engineer triages the week's incidents and adds 5 to 20 rows to eval/cases.yaml from the real traffic that surfaced them. This is the line that separates a maintained harness from a frozen one. The audit checks for a human commit in the last 30 days because that line is where most harnesses go shelf.

rubric.yaml, in full

The exact shape we ship in week 1 across every named production agent. The audit reads this file on Q1, Q2, and Q5: file present, model_primary appears once, ownership block is named.

rubric.yaml

eval/cases.yaml, the part most articles never show

Four rows out of an 80-row file. Two from production_log (the engineer pulled them out of the last 30 days of real traffic). Two adversarial rows the engineer wrote with the product lead in a 90-minute session. Each row has an id, a source, an input, an expected_traits list, and a must_not_include list. The adversarial rows carry a higher rubric_weight because a single jailbreak failure costs more than a single style miss.

eval/cases.yaml

.github/workflows/eval.yml, the line that makes Q4 pass

The single line that distinguishes a live harness from a shelfware one is the gh pr comment line. A CI job that writes the score to an artifact nobody opens is a metric without a feedback loop. Posting to the PR forces the score into the review-window where the engineer can act on it.

.github/workflows/eval.yml

How the harness wires together

The CI job is the hub. Three classes of PR feed it on the left. Three downstream surfaces consume the result on the right. Nothing vendor-hosted in the diagram on purpose; the audit will fail on Q4 if the result lives only in a third-party dashboard.

PRs -> rubric scorecard -> review surface

Receipts: how the audit and the build sequence compare to existing playbooks

Left: the audit and the 5-PR build. Right: the architecture-diagram playbook that shows up in most existing pages on this topic. Neither column is wrong. The right column is incomplete: it tells you what a harness is, not whether yours is alive.

Feature	Architecture-diagram playbook	Audit + 5-PR build
How the harness is described	Architecture diagrams labeled guides + sensors, six-layer stacks, abstract pillars	Six shell commands you can run against your repo today, each with a hard pass/fail and a fix
Time to find out if your harness is shelfware	A reading list of 6 articles, each with a different rubric and no run-it-yourself check	Five minutes, one shell script, on a checkout you already have
Build sequence	4 to 12 weeks, 5K to 20K lines of infrastructure code, no PR-by-PR sequence	5 PRs in 21 days, with day-by-day milestones and the first model swap as the proof PR
Cases.yaml shape	Mentions adversarial testing as one of six layers; never shows the file	80 rows, 68 from real traffic + 12 adversarial, exact YAML schema with id/source/expected_traits/must_not_include/rubric_weight
Where the model name lives	Hard-coded in code samples, never measured against the eval pipeline	Exactly one line in rubric.yaml, enforced by grep -c, swap is a one-line PR
Who owns the harness	Ownership treated as an org-design problem; never enforced in the artifact	primary_engineer email and on_call_rotation are required fields in rubric.yaml; audit fails without them
Definition of working	Defined as the harness existing; never asks whether it has ever fired	git log shows the harness has blocked at least one PR in the last 6 months

The 8-bullet pass-state of a live harness

rubric.yaml exists at the repo root on main
model_primary appears exactly once in rubric.yaml
eval/cases.yaml has at least one human commit in last 30 days
.github/workflows/eval.yml posts a comment on every relevant PR
rubric.yaml names a primary engineer email and on-call rotation
git log shows the harness has blocked at least one PR in the last 6 months
eval/cases.yaml includes adversarial rows authored with the product lead
ragas thresholds (faithfulness, answer_relevancy) live in rubric.yaml, not in code

6 of 6

“Six shell commands, five PRs in 21 days, one model-swap PR as the proof. Same shape across Pydantic AI on Monetizy, LangGraph + Bedrock on Upstate Remedial, custom orchestration on OpenLaw, an automated nightly DAG on PriceFox, and a multi-model pipeline on OpenArt. The harness is the thing that survives the engineer leaving.”

PIAS leave-behind across 5 named production agents, model-vendor neutral

Counts that anchor the rest of the page

Engagement-level facts, not invented benchmarks. Per-client production metrics live on /wins.

0Shell commands in the audit, each with a hard pass/fail

0PRs that produce a harness which passes the audit

0Minimum cases in eval/cases.yaml; 68 real, 12 adversarial

0 daysFrom day-1 PR to first model-swap proof PR

0 commands. 0 PRs. 0 minimum cases. 0 days to the model-swap proof PR. The five numbers that distinguish a live harness from a described one.

Want a senior engineer to run the audit on your repo and write the missing PRs?

60-minute scoping call with the engineer who would own the build. You leave with the 6-command audit run against your repo, the lowest-numbered FAIL named, and a fixed weekly rate to land each of the 5 PRs.

Eval harness, the audit, the 5 PRs, answered

What is an AI agent eval harness, in one paragraph?

An AI agent eval harness is a thin set of files in your repo that defines what correct means for your workload (rubric.yaml), the case set that grades against (eval/cases.yaml plus eval/ragas.jsonl), the CI job that runs the grade on every relevant PR (.github/workflows/eval.yml), and the on-call link that tells you what to do when the grade slips (ops/failure_playbook.md). It is not a SaaS product, not a six-layer architecture diagram, and not a four-month infrastructure project. It is the smallest thing that lets you swap model_primary on Friday without paging anyone on Saturday.

Why is the audit six questions and not twelve or three?

Because each of the six maps to a distinct shelfware failure mode we have seen in real repos. Q1 catches the harness that lives in Notion. Q2 catches the one where model swaps are four-file changes nobody runs. Q3 catches the case set that froze at week 2. Q4 catches the CI job whose output nobody reads. Q5 catches the rubric that has no owner. Q6 catches the harness that exists but has never blocked a single PR. Adding a seventh question we have not seen failures of would dilute the audit; cutting any of the six would let a known failure mode slip past.

Can I run the audit against an agent that already runs in production?

Yes, and that is the more common entry point. We run the six-command audit on day zero of every engagement, including the ones where the agent already serves traffic. A representative pre-engagement repo passes Q1 and Q5 and fails the rest: rubric.yaml exists, ownership block is present, but cases.yaml is 87 days stale, model_primary is hard-coded in two .py files, eval.yml writes to a build artifact instead of the PR, and the harness has never surfaced a regression in git log. The first PR after the audit is usually the one that adds the gh pr comment line to eval.yml. The lowest-numbered failure is always the right next PR.

Why do you require eval/cases.yaml to be 80 rows, with 12 adversarial?

80 is the smallest case-set we have seen reliably catch a per-prompt regression on a real workload across five named production agents (Monetizy.ai, Upstate Remedial, OpenLaw, OpenArt, PriceFox). Below 60 the variance dominates; the same model output can flip the score by 4 points run-to-run, which is louder than the regression you are trying to catch. The 12 adversarial rows force the engineer to sit with the product lead for a 90-minute session and write the inputs that would make the agent dangerous. That session is half the value; the other half is that adversarial rows turn vague safety claims into a row that either passes or does not.

Does this work for LangGraph, CrewAI, AutoGen, Pydantic AI, or only one stack?

All five. The harness lives outside the agent code by design. eval/run.py imports your agent's entry point, runs it against the cases, and scores against the rubric. We have shipped the same harness shape on Pydantic AI (Monetizy), LangGraph + Bedrock (Upstate Remedial), Anthropic with custom orchestration (OpenLaw), an automated nightly DAG (PriceFox), and a multi-model pipeline DAG (OpenArt). The framework choice is downstream of the harness. Treating the harness as a property of the framework is the inversion that produces the shelfware in the first place.

Where do the rubric thresholds (0.82, 0.78) come from? Can I just inherit yours?

The numbers are defaults from agents we shipped, and you should overwrite them at week 0. They come from the scoping call with your product lead, where you write down what fraction of cases must score above what bar before the agent is allowed to ship. rubric_min_score 0.82 means at least 82 percent of cases must score 1.0 against the rubric. ragas_faithfulness 0.78 is a citation-grounding floor that has held up across legal-tech (OpenLaw) and email personalization (Monetizy). You picking your own numbers is the discipline; inheriting ours and never revisiting them is how a rubric becomes evidence in a postmortem instead of a control.

What does it mean for the harness to block a PR? Is that not annoying for engineers?

It means the PR cannot merge until either the regression is fixed or the threshold is explicitly raised in a separate PR with a written rationale. Engineers complain for the first week. They stop complaining around week 3, when the harness catches a quiet regression in a model swap nobody else would have noticed and the agent does not page anyone the following Saturday. The annoying-vs-load-bearing tradeoff is the entire point of Q6 in the audit. A harness that has never blocked a PR is not friendly. It is invisible.

How is this different from promptfoo, DeepEval, ragas, or LangSmith?

Those are tools the eval harness uses; they are not the harness. ragas computes faithfulness and answer_relevancy against your case set, and the rubric.yaml threshold for each is what makes the score actionable. promptfoo and DeepEval are runners; the harness is the wiring that decides which runner is invoked, on which cases, on which PR, and what the failure does. LangSmith is a useful trace store and a non-required complement. The shelfware audit asks whether your repo has the wiring, not which runner is in your requirements.txt.

What if my workload is non-text (vision, audio, code generation)? Does the rubric still work?

Yes, with two adjustments. The expected_traits in cases.yaml become per-modality (an image must contain a face in the lower-left third, an audio clip must transcribe to a known string, a code output must compile and pass a unit test). The rubric scorer in eval/run.py becomes per-modality (a vision model judges an image, a sandboxed Python runner grades a code output). The five-PR build sequence and the six-question audit do not change. The OpenArt pipeline (multi-scene commercial video) ships with an image-grading rubric and the same shape of cases.yaml, just with different expected_traits keys.

Why a 60-line bash audit and not a SaaS dashboard?

Because the harness is supposed to be in your repo, run by your CI, owned by your engineer. A SaaS dashboard that grades your harness is a vendor-attached runtime that the harness itself is designed to flag. Sixty lines of bash needs no license, runs on the same checkout your CI runs on, can be committed to your repo as scripts/audit-eval-harness.sh, and survives the engineer leaving. The discipline is the audit, not the tool that runs it. Every PIAS engagement leaves behind both the audit script and a baseline run on day-0.

Adjacent guides