Guide, topic: production agent eval rubric survival, 2026
The rubrics that survive an AI bubble are the ones keyed to user-observable behavior, not vendor branding.
Vendor pricing changes overnight. Model snapshots get deprecated. Runtimes get acquihired. AI budgets get cut. Whichever shape the correction takes, the question is the same: does your agent keep shipping? The answer is in your rubric.yaml. Clauses keyed to a literal token, regex, schema, length, or class label in the input or output keep grading through a model swap. Clauses keyed to vendor taxonomy (agentic, reasoning quality, tool use score) rot the morning the vendor pivots. This is the audit, the swap diff, and the rubric.yaml shape we ship across five named production engagements.
Direct answer (verified 2026-05-09)
Eval rubrics that survive an AI bubble are behavior-keyed: every clause references a literal token, regex, schema, length, or discrete class label that lives in the input or the output. Eval rubrics that do not survive are vendor-keyed: clauses like agentic_score, reasoning_quality, and tool_use_score depend on a vendor's current branding and lose their referent when the vendor pivots. The model_primary string belongs in agent.yaml as a one-line swappable value; rubric.yaml itself must never name a model or a vendor.
Source: rubric.yaml shape inspected across five named production engagements (Monetizy.ai, Upstate Remedial Management, OpenLaw, PriceFox, OpenArt). Verified operationally during model_primary swaps on each.
The two kinds of clauses, side by side
The same business outcome can be graded two ways. The left column is what most existing playbooks describe. The right column is the shape we ship. Read every row as a single sentence: when the vendor renames or rebrands, the left column stops grading; the right column does not.
| Feature | Vendor-keyed (rots) | Behavior-keyed (survives) |
|---|---|---|
| Order-number presence in a refund email | rubric_trait: agentic_response_quality, scored 0-1 by judge against 'how complete and helpful is the answer' | must_include: ['order_number']; judge returns 'present' / 'absent' on the literal token from input.order_id |
| No fabricated dollar amounts | rubric_trait: hallucination_score, scored 0-1 against 'does the response sound grounded' | must_not_include_pattern: '\$\d+(\.\d{2})?'; judge fails if the regex matches a span absent from the input or retrieved context |
| Output is valid JSON for a downstream API | rubric_trait: structured_output_quality, scored 0-1 against 'how well-formed is the JSON' | format: json.loads(output) succeeds AND jsonschema.validate(output, schema_v3) passes; binary |
| Refuses to draft a contract clause it should escalate | rubric_trait: tool_use_score, scored 0-1 against 'did the agent use the right tool' | expected_class: 'escalate_to_human'; judge classifies among {answer, escalate, refuse}; binary against expected_class |
| Cites the contract clause by section | rubric_trait: reasoning_quality, scored 0-1 against 'show your work' | must_include_pattern: '\bSection \d+(\.\d+)*\b'; judge fails if no section reference appears in the cited span |
| Stays under 280 characters for SMS | rubric_trait: conciseness, scored 0-1 by judge | format: len(output) <= 280; binary |
| What the rubric depends on | Vendor's product taxonomy: 'agentic', 'reasoning', 'tool use'. Renamed when the vendor rebrands. Reweighted when a stronger model ships | Customer-facing behavior the user can verify in the response. Stable across model_primary swaps, vendor rebrands, and judge model upgrades |
The right column never references a model name, a vendor, or a vendor's product taxonomy. That is the survival property.
The shape we ship
Here is the literal rubric.yaml from a refund-email agent we shipped. Five clauses, all behavior-keyed, weighted to a single pass criterion. Three previously-shipped vendor-branded clauses are listed in the deletion log at the bottom; we cut them during a 2026-02 audit and the agent's regression rate went down, not up. The model_primary line is not in this file because it is not allowed in this file.
“When Sonnet 4-class shipped, the entire model_primary swap was a one-line PR in agent.yaml. rubric.yaml did not change. eval/cases.yaml gained six new live rows from the Friday triage that week, none of them required by the swap. CI ran the same eval against the new candidate, the per-clause floors held, and 400K+ outbound emails went out under the new model_primary inside the next five days.”
Upstate Remedial Management, April 2026 model_primary swap
What the swap PR actually looks like
People say 'model swap' the way they say 'database migration' and assume the same blast radius. With a behavior-keyed rubric, it is one line. Here is the diff from a real swap: same client, same rubric, new model_primary. The CI scorecard before and after is included so you can see the per-clause numbers move (or not move).
What happens when the correction hits, day by day
The 'AI bubble survival' question sounds abstract until a concrete trigger fires. Here is the timeline we have run on every engagement that has had a vendor event since 2024.
T+0. The trigger.
Vendor announces model_primary deprecation in 90 days, raises per-million-token pricing 2.4x, gets acquihired by a hyperscaler with a different SLA, or the customer's CFO cuts AI budget 40 percent for the next quarter. Whichever shape the correction takes, the question is the same: does your agent keep shipping?
T+1 day. The audit.
Open rubric.yaml. Search for the model name (claude, gpt, gemini, llama, anthropic, openai). Search for vendor-branded traits (agentic, reasoning_quality, tool_use_score, helpfulness). Every match is a bubble-fragile clause. The number of matches predicts the size of your migration.
T+2 days. The swap PR.
If the rubric is behavior-keyed, the swap is a one-line change to model_primary in agent.yaml. CI runs the same eval/cases.yaml against the new candidate. The scorecard either passes the per-clause floor or it does not. If it does not, the new model is not a viable swap and you try the next one. The rubric is the constant.
T+3 days. The deploy.
Merge the swap PR. Clock 2 (the live judge against the same rubric.yaml) runs against production traffic for 24 hours. The 7-day baseline drift signal will fire if the new model regressed on inputs the case set never anticipated. If it does not fire, the swap is done. We have shipped this on Monetizy.ai (8K emails per day in 1 week post-swap) and Upstate Remedial (400K+ emails sent under a new model_primary in 5 days).
T+30 days. The leave-behind.
rubric.yaml, eval/cases.yaml, eval/judge.yaml, scripts/friday_triage.py, ops/model_swap.md, ops/drift_runbook.md. All in the client repo. Owned by the named senior engineer (us during the engagement, the client's team after handoff). No platform license, no vendor-attached runtime. The 'survival' property is operational, not aspirational.
The audit you can run on your repo this afternoon
Eight checks, each grep-able. If any check fails, the failing clause is on the rot list. We run this on every engagement we inherit; the median client comes in with three to five fragile clauses and a model_primary string copied into rubric.yaml that should not be there.
The eight survival criteria
- Every rubric clause references a token, regex, schema, length, or class label that lives in the input or the literal output, not in a vendor's marketing taxonomy
- model_primary is a one-line string in agent.yaml; rubric.yaml never names the model
- model_primary in agent.yaml has been changed at least once without a rubric change
- The judge model in eval/judge.yaml is a different vendor than model_primary, or has been swapped at least once
- eval/cases.yaml contains at least one live-YYYY-MM-DD-NNN row lifted from production traffic, not a synthetic prompt
- Every must_include / must_not_include clause has an inverse case in eval/cases.yaml that asserts the clause itself is calibrated
- The runbook ops/model_swap.md exists and names the rollback commit if the next model_primary regresses on the same rubric
- There is no SaaS dashboard your rubric depends on; rubric.yaml + judge.yaml + cases.yaml live in the client repo and run on the client's CI
Clauses that fail the audit
Any clause matching one of these patterns is bubble-fragile. Either rewrite it as a behavior-keyed clause grounded in the input or output, or delete it. The list is not exhaustive; it is the seven shapes we see most often when we open a client rubric for the first time.
The rot list
- rubric_trait: agentic_score (depends on a vendor's current definition of 'agentic')
- rubric_trait: reasoning_quality (depends on judge model having the same internal definition as the vendor's marketing site)
- rubric_trait: tool_use_score (depends on a fixed tool API that the vendor will rev when they ship Tools v2)
- rubric_trait: helpfulness (subjective, drifts with judge model upgrade, has no inverse case)
- rubric_trait: 'uses Claude / GPT / Gemini reasoning' (the rubric becomes false the morning the runtime swaps)
- Pass criterion: 'judge score above 0.8' with no per-clause floors (one strong clause carries five fragile ones)
- Pass criterion stored in a vendor's eval-platform UI rather than rubric.yaml in the repo
What survival looks like in numbers
Four numbers from the engagements running this rubric shape today. None are vendor benchmarks; all four are properties of the rubric.yaml file itself.
Why most rubrics fail the survival test
The same forces that produce vendor-branded rubrics produce vendor-branded careers. An eval engineer at a typical AI team gets handed a vendor's eval platform on day one. The platform ships with a built-in trait taxonomy: agentic, reasoning, tool-use, helpfulness. The engineer wires those traits into dashboards, those dashboards become a quarterly review artifact, and the trait names ossify into the team's vocabulary. When the vendor renames 'agentic' to 'agent quality v2' the next quarter, the dashboards break, the quarterly review still has the old chart, and the team spends two sprints rebuilding what they had. Repeat every 9 months.
A behavior-keyed rubric is the opposite work shape. The engineer writes a clause once, in plain language, grounded in what a customer or a downstream system can verify. The clause survives every vendor rebrand because it does not depend on the vendor's vocabulary at all. The only thing that changes when a vendor pivots is the model_primary line in agent.yaml. That is the work we ship.
The bubble-survival framing is not anti-vendor; it is anti-coupling. Use whichever vendor wins on price and capability for your workload. Pin the rubric to your customer's behavior, not to the vendor's marketing site.
The 60-second self-test
Open your rubric.yaml right now. Run the following greps. If any of them return a non-empty result, you have at least one bubble-fragile clause. The number of matches predicts the size of your next migration.
grep -iE 'claude|gpt|gemini|llama|anthropic|openai|bedrock|vertex' rubric.yamlgrep -iE 'agentic|reasoning_quality|tool_use_score|helpfulness' rubric.yamlgrep -E 'score:.*0\.[0-9]+' rubric.yaml | wc -l-- ratio of subjective scored clauses to total clauses
Zero matches on the first two greps and a low ratio on the third is the survival shape. Anything else is a sign your rubric is pinned to the vendor cycle, not to the customer.
Run the rubric audit on your repo with a senior engineer
30 minute call. We open rubric.yaml, run the eight checks live, and tag every clause as keep, rewrite, or delete. You leave with a written one-pager and the tagged clauses.
Frequently asked questions about rubric survival
What does 'production AI agents eval rubric survival' actually mean?
It means: when something about the AI market changes that you do not control (vendor pricing, model deprecation, runtime acquihire, judge model upgrade, customer AI budget cut), the rubric you grade your agent against keeps grading and your agent keeps shipping. The opposite is a rubric that has to be rewritten every time a vendor rebrands, which is what most teams have. The shape that survives keys every clause to user-observable behavior in the input or the output: a token (must_include order_number), a regex (must_not_include fabricated dollar amount), a schema (output parses against schemas/refund_action_v3.json), a length (under 280 chars), or a discrete class label (escalate vs refuse vs answer). The shape that does not survive keys clauses to vendor taxonomy: agentic_score, reasoning_quality, tool_use_score, helpfulness. When the vendor pivots, those clauses lose their referent and the rubric stops grading meaningfully.
What is a 'bubble-fragile' clause? How do I find them in my rubric?
Open rubric.yaml. Search for any of the following strings: claude, gpt, gemini, llama, anthropic, openai, agentic, reasoning, tool_use, helpfulness, structured_output_quality, hallucination_score, response_quality. Every match is a candidate for fragility. A clause is bubble-fragile if it (a) names a vendor or a model snapshot, (b) names a vendor's product taxonomy that the vendor controls, or (c) is graded by a 'how good is the answer' subjective rubric with no inverse case in eval/cases.yaml that asserts the clause itself is calibrated. Each fragile clause is one clause you will need to rewrite the next time the vendor changes the names. Five fragile clauses is a quarterly migration. Twenty is a permanent stall.
Aren't subjective traits like reasoning quality the whole point of LLM-as-judge?
They are not the whole point; they are one tool, and most teams overuse them. Behavior-keyed clauses (must_include, must_not_include, format, expected_class) carry the rubric. Subjective traits are useful when they have an inverse case in eval/cases.yaml: a row with the trait at high score and a row with the trait at low score, both hand-labeled, both used to calibrate the judge every 30 days. Without the calibration cases, the subjective trait drifts on judge-model upgrade and you cannot tell whether the agent regressed or the judge re-learned. We keep at most one or two subjective traits per rubric and we always pair them with calibration cases. The behavior-keyed clauses are 70 to 80 percent of the weighted average and they do not require calibration to keep grading.
What does a vendor-neutral rubric.yaml look like in practice?
It contains zero references to a model name. It contains zero references to a vendor's product taxonomy. Every clause has a kind in {must_include, must_not_include, must_not_include_pattern, format, expected_class}, a spec keyed to the input or the output (a token, a regex, a JSON schema, a length, a class label), and a per-clause weight. Pass criteria are per-clause floors plus a weighted average floor, not a single average. The model_primary string lives in agent.yaml, one line, swappable. The judge model lives in eval/judge.yaml, one line, swappable. rubric.yaml itself is hundreds of lines and does not name any of them. The full shape is in the snippet above.
How do I migrate a vendor-branded rubric to a behavior-keyed one without losing what I already have?
Run the audit (T+1 day in the timeline above) and tag every clause as keep, rewrite, or delete. For a 'reasoning_quality' clause, find the 5 to 10 cases where you actually cared about reasoning and look at what you wanted the output to literally do. Did it have to cite a section number? Did it have to break a calculation into steps? Did it have to refuse instead of guess? Each of those is a behavior-keyed rewrite. Cut subjective clauses that do not have an inverse case. The migration takes one to two weeks for a typical 15-clause rubric, and the side effect is that the new rubric grades more deterministically than the old one. We have run this migration on three of the five named engagements; the rubric got shorter and the regression rate went down.
How does this connect to the two-clock production evals system?
rubric.yaml is the file both clocks share. Clock 1 (the offline harness in CI) reads it on every PR. Clock 2 (the live judge sampling production traffic) reads the same file on every cron tick. If the rubric is behavior-keyed, both clocks survive a model swap and a judge model swap with no rubric change. If the rubric is vendor-branded, both clocks break together when the vendor pivots, which is the worst case because the offline harness goes silent at the same moment production traffic stops being graded. We covered the two-clock mechanics in /t/production-ai-agent-evals; this guide is about the file content that both clocks read.
What about high-stakes domains like legal, medical, or financial agents?
Behavior-keyed rubrics are stricter for high-stakes domains, not looser. A legal-tech rubric we ship has clauses like 'must_include_pattern: Section \\d+', 'must_not_include_pattern: \\$\\d+(\\.\\d{2})? unless allowed_if_present_in retrieved_context', 'expected_class: escalate when input.intent in [contract_drafting, fee_dispute]'. Each clause is a literal check against the output that a paralegal can verify. The high-stakes shape adds 100 percent always_judge sampling on rows that include numeric values or named persons, which is a Clock 2 setting. The rubric.yaml shape is unchanged.
What survives if the customer cuts AI budget by 40 percent next quarter?
The rubric, the case set, the runbook, the eval harness, the judge worker, and the named senior engineer's continuity. We size every engagement so that the 6-week handoff produces a system the client's team can operate on a 0.5-engineer-month maintenance budget: a Friday triage script that costs 30 minutes, a judge worker that costs ~$24 per agent per month at 8K calls per day, and a rubric that does not require a platform subscription. If the customer cuts budget further, the agent runs on a cheaper model_primary with a one-line swap; the rubric grades the new model and either passes the per-clause floors or does not. There is no platform invoice that survives the budget cut and no vendor-attached runtime that disappears with it.
Where does fde10x sit in this picture?
We are the studio that embeds named senior ML engineers in your repo for 2 to 6 weeks and ships the rubric.yaml, eval/cases.yaml, eval/judge.yaml, scripts/friday_triage.py, and ops/model_swap.md as a leave-behind. The 'survival' property is structural to how we engage: we do not sell a platform subscription, we do not run a vendor-attached runtime, and the model_primary line is yours to swap forever after we leave. The differentiator is that the IP stays with the client when the engagement ends, which means the survival property is not contingent on us. If you want the rubric audit run on your existing setup, the call link below is the fastest path.
What if I want to keep a vendor-branded rubric anyway because that's what my eval platform reports on?
Keep it as the public dashboard if a stakeholder asks for 'reasoning quality up and to the right'. But the rubric that gates the merge in CI and pages on-call from production drift cannot be the dashboard rubric. Those are the two clocks; both clocks must read a behavior-keyed rubric.yaml that the client repo owns, not a vendor's eval platform. The dashboard becomes a derived view, computed from the same judge_scores rows the behavior-keyed rubric writes. We have shipped that split where stakeholders required it; the dashboard runs on the side, the load-bearing rubric stays in the repo.
Adjacent guides
Production AI agent evals: the two-clock system
rubric.yaml is the file both clocks read. Pre-merge harness in CI plus post-deploy live judge with a 5-point drift threshold.
Agent eval set, model swap, trust
Two-number trust card you publish before any model swap: kappa over 0.85, eval-to-production Spearman over 0.7.
Active recall eval harness for AI agents
Why the offline harness only catches what it has already seen, and how the Friday triage loop closes that.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.