Guide, topic: production agent eval harness for enterprise, 2026

Your eval harness is not enterprise-grade until six properties survive a vendor exit test.

Most pages on this topic describe an eval harness as architecture: layers, components, vendor dashboards. An enterprise-grade harness has six procurement-grade properties orthogonal to whether it grades correctly. It lives in your repo, model_primary is one swappable line, it emits EU AI Act Annex IV artifacts as a byproduct, it commits a signed per-PR scorecard to git, it names a primary engineer in the rubric file itself, and it survives any engineer or vendor leaving. Each property has a shell test. The vendor exit script at the end is the one we run on a candidate vendor's reference repo before our clients sign.

M
Matthew Diakonov
13 min read
4.9from 6 procurement-grade properties, mapped to Annex IV section numbers, with a 5-step bash exit test
Property 1 to 6 verified on a candidate vendor's repo with bash + git + grep, in 15 minutes
Annex IV §2(g), §3, §9 mapped to rubric.yaml, eval/cases.yaml, and the post-deploy live judge
Same harness shape across Pydantic AI, LangGraph + Bedrock, custom Anthropic, automated DAG, multi-model pipeline

Same procurement rubric across Pydantic AI on Monetizy, LangGraph + Bedrock on Upstate Remedial, custom Anthropic on OpenLaw, an automated DAG on PriceFox, and a multi-model pipeline on OpenArt.

Five named production agents, model-vendor neutral by construction

Run the procurement rubric with an engineer

Direct answer (verified 2026-05-01 against artificialintelligenceact.eu/annex/4)

What makes an eval harness enterprise-grade?

Six properties, each verifiable by a shell command on the candidate vendor's repo:

  1. Lives in the client repo (rubric.yaml, eval/cases.yaml, .github/workflows/eval.yml), not a vendor SaaS.
  2. model_primary is exactly one line in rubric.yaml, swappable without vendor permission.
  3. Emits EU AI Act Annex IV §2(g) (validation procedures, data characteristics, accuracy and robustness metrics) and §9 (post-market monitoring) artifacts as a byproduct of normal operation.
  4. Commits a signed per-PR scorecard to git history, so the audit trail does not depend on a vendor database.
  5. Names a primary engineer email and an on-call rotation directly inside rubric.yaml; the audit fails if either line is missing.
  6. Survives any engineer or vendor exit, verified by a five-step bash script run before signing and weekly in CI after handoff.

Authoritative source for the Annex IV citations: the European Commission's published Annex IV at artificialintelligenceact.eu/annex/4. Article 11 enforcement for Annex III high-risk systems begins August 2, 2026.

Why most pages on this topic are answering a different question

Open the dozen guides that currently come up for this topic. Five of them are vendor pages selling a SaaS dashboard. Three are framework blog posts describing harness anatomy as a property of the framework. Two are repo READMEs with simulator counts. One is a generic "five layers from demo to production" article. They are all answering the question "what is an eval harness." None of them answer "what makes an eval harness enterprise-grade," because the second question is not about architecture. It is about the properties of the artifact a procurement reviewer, a legal reviewer, and an on-call engineer can independently verify.

We learned this on the third engagement we ran. The agent worked. The harness ran. The CI was green. The vendor we replaced had built a perfectly competent harness, and it was useless to the client the day the contract ended. The rubric lived in the vendor's hosted console. The case set was a CSV in a vendor S3 bucket. The runner was a vendor binary on a vendor runner. The thresholds were a dashboard slider, not a YAML key. Everything described in every existing guide on this topic was satisfied. None of it was enterprise-grade.

The six properties below are the inversion. They are properties of the harness as code, not of the vendor's company. They are independently verifiable on a checkout, by a reviewer who has never spoken to the vendor. They survive a vendor exit because we test the exit before we sign, and we test it again every Friday.

The procurement rubric: six properties, each with a shell test

Left column: the property and its leave-behind file. Right column: the SaaS-attached version we replaced on three engagements. Neither column is wrong about what the harness does. The right column is wrong about who owns the harness when the relationship ends.

FeatureSaaS-attached harnessRepo-native harness (PIAS shape)
Property 1. Where the harness livesA vendor SaaS that hosts the rubric, the case set, and the runner. Stops grading the day the contract lapses or the vendor pivots.rubric.yaml, eval/cases.yaml, .github/workflows/eval.yml in your repo. Runs on your CI minutes. Survives the vendor's billing portal going dark.
Property 2. Where model_primary livesHard-coded in agent code, in a vendor dashboard, or in three .py files. Swapping requires a vendor ticket or a refactor PR.Exactly one line in rubric.yaml. grep -c '^model_primary:' rubric.yaml returns 1. Swap is a one-line PR, no vendor permission required.
Property 3. EU AI Act Annex IV mappingAnnex IV mentioned as a feature on a marketing page. Documentation lives in a vendor PDF you cannot regenerate from your repo.rubric.yaml + eval/cases.yaml + the per-PR scorecard satisfy §2(g) (validation/testing procedures, metrics for accuracy and robustness) and §9 (post-market monitoring) by construction. The procurement reviewer reads the same files.
Property 4. Audit trail per PRScorecards live in a vendor's database. Export is JSON-on-demand. The audit depends on the vendor still serving the data.eval/_artifacts/scorecard-<sha>.md committed to a long-lived branch on every CI run. Signed with the GitHub commit signature. The audit reconstructs any past decision from git log.
Property 5. Named owner inside rubric.yamlOwnership is an org-design problem solved in Confluence. The harness has no opinion on who gets paged.primary_engineer: <human@email>, on_call_rotation: <slack-or-pagerduty-tag>. The audit fails on Q5 if either line is missing. The rubric pages a human, not an inbox.
Property 6. Vendor exit survivalVendor exit is a 90-day rip-and-replace covered by a contractual clause that nobody has tested.Run scripts/vendor-exit-test.sh: delete the vendor wrapper, swap model_primary to a different provider, run the rubric. Pass means the harness survives. Failure names the lock-in.

Property 3 in detail: how the harness satisfies EU AI Act Annex IV

Annex IV is the technical documentation schedule referenced by Article 11. It has nine numbered sections. Three of them are satisfied by a well-shaped eval harness with no separate documentation pass. The mapping is mechanical, which is the point; documentation that requires regeneration every quarter is documentation that is wrong every quarter.

  • Annex IV §2(g)

    Validation and testing procedures used, including information about the validation and testing data used and their main characteristics; metrics used to measure accuracy, robustness. The artifact in your repo: rubric.yaml (the metrics and thresholds) plus eval/cases.yaml (the validation data and its characteristics, with id, source, expected_traits, and rubric_weight per row).

  • Annex IV §3

    Information about monitoring, functioning, and control of the AI system, including capabilities and limitations in performance, degrees of accuracy for specific persons or groups, and foreseeable unintended outcomes. The artifact in your repo: eval/cases.yaml adversarial rows authored with the product lead, plus the per-case scorecard that records degree of accuracy on each subgroup over time.

  • Annex IV §9

    Detailed description of the system to evaluate AI system performance in the post-market phase, including the post-market monitoring plan referred to in Article 72. The artifact in your repo: the post-deploy live-judge that samples production traffic against the same rubric.yaml, the 5-point drift threshold, and the Friday triage that lifts the worst rows back into eval/cases.yaml.

The other six Annex IV sections (general description, system architecture, data requirements, human oversight, lifecycle changes, standards compliance, declaration of conformity) are not satisfied by the harness alone. They sit alongside it in ARCHITECTURE.md, ops/runbook.md, and ops/conformity-declaration.md. The procurement rubric only claims the three that the harness covers.

rubric.procurement.yaml: the file procurement actually reads

One YAML file, lives at the repo root next to rubric.yaml, edited by PR. Procurement reads the same file the engineer edits. Legal maps the threshold lines to Annex IV section numbers. The on-call engineer reads the ownership block. The vendor exit script asserts on the vendor_exit block. One artifact, three audiences, no PDFs to regenerate.

rubric.procurement.yaml

Anchor: the vendor exit test, in shell

Property 6 is the only property that gets harder to test after the relationship ends, which is why we run it before signing. Five steps, no dependencies beyond bash, git, and grep. Run from the root of the candidate vendor's reference repo. We have walked away from two engagements over a 0-of-5 result on this script.

scripts/vendor-exit-test.sh

What a passing exit test looks like on a repo-native harness

A literal run against the Monetizy.ai-shape repo at the end of a 6-week PIAS engagement. Five of five. The harness survives PIAS leaving, and survives a future provider swap.

monetizy/outbound-agent -- 5 of 5 PASS

What a failing exit test looks like on a representative SaaS-attached repo

A run against a representative pre-engagement candidate repo we audited on day -7 of a follow-on engagement. Zero of five. Each failure names a vendor lock that costs real money to unwind during an exit, and is cheap to fix during a fresh engagement.

acme/legacy-agent -- 0 of 5, harness has 5 vendor locks

The pre-signature checklist, in 6 grep lines

If your procurement team prefers a one-page checklist over a bash script, here it is. Six lines, six properties, one to one with the rubric above. Drop it into the technical-due-diligence section of the SoW review template.

docs/pre-signature-checklist.sh

How the rubric runs across the engagement timeline

The procurement rubric is not a one-time check. It runs five times: once during the scoping call, once during the technical review, once during the legal review, once on the kickoff PR, and weekly forever after handoff. The point is not the audit; the point is that the audit cannot lie because every artifact it grades is in your repo.

1

Day minus 14, scoping call: read the candidate's rubric.yaml in the call

Ask the vendor to share-screen and walk through rubric.yaml in their reference repo. If the file does not exist, or the rubric lives only in a SaaS console, Property 1 fails before the proposal is written. We have walked away from engagements over this single property.

2

Day minus 7, technical review: clone the reference repo, run vendor-exit-test.sh

The vendor sends a representative checkout of an engagement they shipped (with NDA-friendly redactions). Five steps, one bash script, fifteen minutes. Anything below 4-of-5 is renegotiable; below 3-of-5 is a no.

3

Day minus 3, legal review: map rubric.procurement.yaml to Annex IV by section number

Property 3 is a legal-review property, not an engineering one. Open Annex IV. For sections 2(g), 3, and 9, point at the file in the candidate repo that satisfies it. A vendor who cannot do this in 30 minutes will not be ready when the August 2 2026 enforcement date arrives.

4

Day 1, kickoff PR: PIAS lands rubric.yaml, rubric.procurement.yaml, and vendor-exit-test.sh

Three files in the first week. The CI gate goes green by day 10. The first model swap PR (the proof PR) lands by day 21. By day 42, the harness has blocked at least one regression in production, which is what makes Property 6 real instead of theoretical.

5

Day 42 handoff: vendor-exit-test.sh runs in your CI as a weekly cron

The leave-behind includes a .github/workflows/vendor-exit-cron.yml that runs the test every Friday and posts to your engineering channel. The day PIAS leaves, the cron keeps running. Six months in, the cron is what proves the leave-behind held up.

Receipts: how the procurement rubric compares to the vendor pitch

Left: the rubric and the leave-behind. Right: the typical vendor pitch on this topic, simplified but recognizable. Neither column is dishonest. The right column answers a different question, and the difference is what makes Property 1 through Property 6 worth checking before the contract is signed.

FeatureTypical vendor pitch on this topicProcurement rubric (PIAS shape)
What the page asksWhat is an eval harness, in concept? Six layers, four pillars, a vendor dashboard demo.Is this harness enterprise-ready? Six properties orthogonal to grading correctness, each with a shell test.
Definition of enterpriseSelf-hosted deployment for enterprise customers, plus SOC 2 on the vendor's website.Procurement reviews the harness. Legal maps it to Annex IV. The harness pages a named human. The vendor can leave without the harness leaving with them.
EU AI Act treatmentMentions compliance as a feature pillar. PDFs available on request, regenerated by a vendor team you cannot ping at 2am.Annex IV §2(g), §3, §9 are mapped to specific files in the repo. The procurement reviewer reads the same artifact the engineer edits.
Audit trailScorecards live in a vendor database. Export is a JSON download. Audit depends on the vendor's data retention policy.eval/_artifacts/scorecard-<sha>.md committed to a long-lived branch on every CI run. git log reconstructs any past decision.
Vendor exitA 90-day exit clause in the contract, never tested mechanically.scripts/vendor-exit-test.sh, run on a candidate's reference repo before signing, and weekly in your CI after handoff.
Model vendor neutralitySingle-provider runner with a marketing claim about portability. Swapping requires a vendor ticket.model_primary is one line. Anthropic, OpenAI, Bedrock, Vertex, Azure, or open-weight per the client's choice. The runner abstracts the provider.
Definition of workingDefined as the harness existing and the dashboard rendering. Never asks whether it has fired.git log shows the harness has blocked at least one PR in the last 6 months. The vendor-exit cron has run green for 8 consecutive weeks.
6 of 6

Six procurement-grade properties, one bash script for the vendor exit test, and an Annex IV mapping a legal reviewer can verify on a fresh checkout. Same shape across Pydantic AI on Monetizy, LangGraph + Bedrock on Upstate Remedial, custom Anthropic on OpenLaw, an automated nightly DAG on PriceFox, and a multi-model pipeline on OpenArt. The harness is the thing that survives the vendor leaving.

PIAS leave-behind across 5 named production agents, model-vendor neutral, no platform license

Want a senior engineer to run the procurement rubric on a candidate vendor's repo with you?

60-minute call with the engineer who would own the build. You leave with the 6-property rubric scored against your candidate, the failing properties named, and a fixed weekly rate to land the missing PRs.

Procurement-grade eval harness, the rubric, the exit test, answered

What does enterprise-grade mean for an eval harness, in one sentence?

An enterprise-grade eval harness is one that survives three independent reviews on day -7 of an engagement: an engineering review (does it grade correctly), a procurement review (does it live in our repo, can we audit it, does it map to Annex IV), and a legal review (will it still be ours if the vendor exits). Most existing harnesses pass the first review and fail the other two. The six properties on this page are the three-review checklist collapsed into shell tests.

Why these six properties and not a SOC 2 control list?

SOC 2 is a control framework for the company that builds the system; it does not describe properties of the system itself. A vendor can be SOC 2 Type II compliant and still ship a harness that lives in their SaaS, has model_primary hard-coded in three .py files, and produces no Annex IV artifacts. The six properties on this page are properties of the harness as code, not of the vendor's company. They are independently verifiable on a checkout, by a reviewer who has never spoken to the vendor.

Is the August 2 2026 EU AI Act deadline real, and does it apply to my agent?

Yes, August 2 2026 is the date Annex III high-risk AI system requirements become enforceable for new systems placed on the EU market. Whether it applies depends on whether your agent meets the Annex III definition (the most common triggers are creditworthiness scoring, employment screening, essential public services, and law-enforcement use). If it does, Annex IV technical documentation is mandatory, and the eval harness is the cheapest place to source most of it. If it does not, the procurement rubric still holds; you just skip the rubric.procurement.yaml classification line.

What does Annex IV section 2(g) actually require, and how does my eval harness satisfy it?

Section 2(g) requires the technical documentation to describe the validation and testing procedures used, the validation and testing data and their main characteristics, and the metrics used to measure accuracy and robustness. The harness covers it directly: rubric.yaml is the metrics and the thresholds (rubric_min_score, ragas_faithfulness_min, ragas_answer_relevancy_min, max_per_case_regression). eval/cases.yaml is the validation data and its characteristics (id, source, expected_traits, must_not_include, rubric_weight per row). eval/run.py is the procedure. A reviewer reading these three files satisfies §2(g) without a separate documentation pass.

What is the vendor exit test, and why is it Property 6?

scripts/vendor-exit-test.sh is a five-step bash script that runs against a candidate vendor's reference repo before you sign, and against your own repo weekly after handoff. It checks that the harness lives in your repo, model_primary is editable, CI runs on your minutes, scorecards are committed to git, and a provider swap is reversible. It is Property 6 because the other five properties are useless if the harness silently relies on a vendor service that disappears with the contract. We added this property after the third engagement where a previous vendor's exit clause turned out to be a 90-day rip-and-replace that nobody had ever tested mechanically.

Does this work if my agent is on Bedrock, Vertex, Azure OpenAI, or a self-hosted open-weight model?

Yes, any of the above. The runner (eval/run.py) abstracts the provider behind a thin client class. The same rubric.yaml, the same eval/cases.yaml, and the same .github/workflows/eval.yml ship on Bedrock with Claude (Upstate Remedial), Anthropic direct (Monetizy), an automated nightly DAG with mixed providers (PriceFox), and a multi-model pipeline (OpenArt). The procurement rubric is provider-neutral by construction; if it were not, Property 2 would be unenforceable.

How is the per-PR scorecard signed, and why does that matter?

The CI job commits eval/_artifacts/scorecard-<sha>.md back to a long-lived branch with the standard GitHub Actions bot identity. The commit is signed with the GITHUB_TOKEN signature on GitHub-hosted runners or with the org's signing key on self-hosted runners. The signature matters because procurement reviews the audit trail, not the dashboard. A signed commit on the org's own GitHub is evidence; a row in a vendor database is a screenshot at best. When a regulator or an auditor asks how the agent scored on case real-014 on 2026-03-12, git show <sha>:eval/_artifacts/scorecard-<sha>.md returns the answer in one command.

What does PIAS leave behind on day 42 that makes the six properties durable?

Three files plus one cron. rubric.yaml (week-1 PR) defines the thresholds and the model_primary line. rubric.procurement.yaml (week-2 PR) maps each threshold to its Annex IV section number, names the primary engineer, and points at the vendor exit script. scripts/vendor-exit-test.sh (week-2 PR) is the five-step shell test you can run before any future vendor engagement. .github/workflows/vendor-exit-cron.yml (week-6 PR) runs the test every Friday and posts to the engineering channel. The 90-minute transfer session walks through each file with a named engineer on your side. After day 42, none of these files reference PIAS; they reference your engineers, your CI, your runner.

What if my current vendor fails the procurement rubric? Can the harness be rebuilt without rebuilding the agent?

Yes, and this is the more common entry point. We have audited 4 in-flight vendor engagements on day zero of follow-on work; in 3 of them the harness failed Property 1 (lived in the vendor's SaaS), Property 4 (no scorecards in git), and Property 6 (model_primary in 4+ files). The rebuild was 5 PRs over 21 days, no changes to the agent runtime. The agent kept serving traffic; the harness moved into the client's repo, the model swap on week 3 took a one-line PR, and the August 2 2026 documentation pass became a two-day exercise instead of a six-week scramble.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.