Guide, topic: production ready llm releases, enterprise
A new LLM release is a three-file pull request, not a six-week review.
When GPT-5, Claude 4.7 Opus, or Gemini 3 ships, the teams that already have a PIAS leave-behind in their repo qualify it the same afternoon. Flip one line in rubric.yaml. Push the branch. Let .github/workflows/eval.yml score the release against the case set the agent was accepted under in week 2. A scorecard posts to the PR thread. If it clears, the old primary drops to fallback and the engagement with the new model begins. No evaluation platform license. No vendor runtime. Five production agents run on this shape today.
Model-vendor neutral. Your keys, your CI, your cloud. No PIAS-hosted runtime.
Monetizy.ai / Upstate Remedial / OpenLaw / PriceFox / OpenArt
The frozen pipeline problem (and why most guides skip it)
Most writing on production ready llm releases enterprise covers three things: LLM gateways, evaluation platforms, and a generic rubric for “what production-grade looks like”. Useful reference material, but none of it answers the only question a CTO actually asks on release day: is this new model better than the one we are running, on the workload we ship?
The honest answer is that most teams cannot answer it quickly, because the thing that would answer it — a rubric tied to real traffic, sitting inside CI — was never built. A new release lands, security opens a ticket, procurement schedules a call with the evaluation platform vendor, the gateway team tests tool-calling compatibility, and by the time anyone can say “yes, merge it” the next model has shipped. The pipeline freezes by default.
The PIAS leave-behind is designed to make that question a CI job. The six weeks of the engagement are not really about shipping the agent; they are about earning the right to evaluate every future LLM release in a single afternoon, on your case set, with a scorecard the reviewer can trust.
Release day, two engagement shapes
A new LLM release lands Tuesday morning. Security opens a ticket. The evaluation platform vendor schedules a call. Someone on legal asks whether the gateway contract covers the new SKU. Six weeks of sandbox benchmarks, procurement review, and a renewed scoping doc later, the team decides whether to even pilot it.
- Security review scheduled, not started
- Benchmark suite is the vendor's, not yours
- No case set from your own traffic exists
- Verdict arrives weeks after the release
Anchor fact: the three files a qualifying PR touches
The week-6 leave-behind delivers four artifacts to main: the orchestration definition, the eval harness, the failure playbook, and the on-call runbook. Three of those four are the mechanism that qualifies every future LLM release. The fourth (orchestration) is untouched on a pure model swap — that is the discipline. If a model qualification PR starts editing agents/*.py, it is not a model qualification anymore.
Anchor fact
Three files, one scorecard, one decision.
- rubric.yaml. One-line change: model_primary. Thresholds (rubric_min_score 0.82, ragas 0.78, max_per_case_regression 3) are unchanged across releases because they encode what correct means for your workload, not what a given model can do.
- .github/workflows/eval.yml. Not edited by the qualification PR. Triggered by it. Runs the case rubric, runs ragas, runs the per-case regression check, posts the scorecard. Shipped on day 42 and never edited again unless the case shape itself changes.
- ops/failure_playbook.md. Updated in the same PR on a pass: old primary promoted to fallback, rate limits preserved, scorecard URL recorded. Your on-call engineer at 3am on a Saturday reads this file, not a dashboard.
rubric.yaml, with the only line a qualification PR changes
This is the exact shape of rubric.yaml that lands on main in the week-6 leave-behind. Thresholds were captured on the week-0 scoping call with your product lead. Cases came from your real traffic plus a short adversarial set the engineer wrote with you. On release day, one line changes.
eval.yml, the harness that runs every future release for you
The workflow ships on day 42 and keeps running after the engagement ends. It is triggered by any PR that changes the agent, the eval set, rubric.yaml, or the failure playbook. On a model-qualification PR, the interesting steps are the case rubric and the per-case regression check; ragas is the guardrail.
What a real qualification PR looks like end-to-end
A literal session from a recent claude-4-7-opus qualification run. No screenshots, no dashboards, no vendor consoles. Git and GitHub Actions do the whole thing. The scorecard is a comment on the PR.
The ops/failure_playbook.md diff, in the same PR
This is the second file the qualification PR touches. It is not an afterthought. The playbook is what keeps the on-call runbook true after a model swap. If the new primary fails in production a month later, the file that actually gets opened at 3am has to name the fallback correctly. We put that update inside the qualifying PR, not a follow-up.
Six steps, plain order
The qualification run, as written on the engineer's checklist. This is the order we teach your team during the week-6 handoff so the next release does not require us.
Step 0: confirm the release supports the interfaces your agent uses
Read the release notes for the three things your orchestration actually depends on: tool-use schema, structured-output mode, and streaming token shape. If any of those changed, the work is not a one-PR swap, it is a shim; stop here and schedule a scoping call. In practice most 4.x to 4.y bumps are clean swaps, most x to x+1 major bumps are not.
Step 1: open a branch named model/<provider>-<version>
Branch off main. The branch name is load-bearing because the eval harness tags the scorecard with the branch, so the artifact that lands in the PR thread is searchable from day one.
Step 2: change rubric.yaml model_primary, commit
One-line change: model_primary: claude-4-6-sonnet to model_primary: claude-4-7-opus. Commit message includes the release date and a link to the provider changelog so a future engineer can reproduce the decision later.
Step 3: push the branch, let GitHub Actions run eval.yml
The workflow already exists on main because it was in the week-6 leave-behind. It runs python -m eval.rubric against eval/cases.yaml, then ragas faithfulness and answer_relevancy, then posts a scorecard comment to the PR. No human runs anything. No dashboard gets opened.
Step 4: read the scorecard, decide
Three outcomes. (a) All thresholds clear and no case regressed by more than 3 points, merge after review. (b) Thresholds clear but one case regressed, add it to the per-case allowlist with justification and request a second reviewer. (c) Any threshold fails, the PR cannot merge, the branch stays as a record that this release was evaluated and did not pass on this workload.
Step 5: on a pass, update ops/failure_playbook.md in the same PR
Promote the old model to fallback in the failure playbook, keep its rate-limit and timeout settings, record the scorecard URL and date. Now the on-call runbook is still truthful at 3am on a Saturday. This is the step most teams skip; it is why a second regression a month later turns into a two-day incident.
The same PR shape on five live production agents
Each card is a named production agent with the qualification shape already on main. Stacks, fallbacks, and case sets differ; the three-file PR does not.
Monetizy.ai: Pydantic AI, Anthropic + OpenAI dual-route
Around 8K emails per day. When Claude 4.6 shipped, the qualification PR flipped model_primary and the eval job ran against the same deliverability-scored rubric used in week 2. Merged the same day.
Upstate Remedial: LangGraph, Bedrock primary + OpenAI fallback
400K plus legal-compliance emails. New-model qualification is stricter here because the rubric includes deterministic compliance checks that must not regress. The PR shape is the same; the case set has more adversarial rows.
OpenLaw: Anthropic + citation-verification subagent
Publicly released AI-native law editor. The citation-verification subagent has its own rubric in eval/cases.yaml, so model swaps are evaluated twice: once for the drafter, once for the verifier. Scorecard splits the two.
PriceFox: automated eval CI on every release
Multi-tenant retrieval agent. Nightly canary rollouts mean new-model qualification does not wait for a human; a provider release triggers a scheduled PR that runs the rubric on the last 24 hours of real traffic. The canary takes 10 percent for four hours, then the scorecard gates promotion.
OpenArt: custom DAG, multi-model inference
Commercial video auto-generation, per-scene quality gates. Because the DAG already routes per scene, new models can be qualified on a single node class (for example, prompt-repair) without touching the rest of the tree. The eval harness scopes by node id.
The wiring: provider release, your repo, your scorecard
The hub is the rubric. Upstream, provider releases. Downstream, the three files the PR writes and the scorecard it posts. Nothing vendor-hosted in between, on purpose.
Provider release → rubric.yaml → scorecard on your PR
What the common playbook gets wrong
Left: what the PIAS leave-behind commits to on day 42. Right: the shape most teams default to when a release lands with no harness already in their repo. The rows are not a caricature; they are the difference between merging by Tuesday afternoon and scheduling a follow-up call for next quarter.
| Feature | No harness in the repo | PIAS leave-behind |
|---|---|---|
| Time from release to a yes/no verdict on your workload | Weeks of sandbox benchmarks, a security re-review, and a renewed scoping doc | A few hours: one PR, one CI run, one scorecard on real case slices |
| Where the qualification job lives | In a vendor sandbox or a gateway dashboard you log into with SSO | In your repo, on main, as .github/workflows/eval.yml, keyed to the same thresholds the production agent was accepted under |
| What the rubric is scored against | A general benchmark suite chosen by the platform vendor, occasionally refreshed | The case set your week-2 gate was scored against, with rubric min_score 0.82 and ragas min 0.78, stored in eval/cases.yaml |
| What happens to the previous model on a pass | Disabled or hidden in the gateway console, nothing written to a playbook your on-call can read at 3am | Gets promoted to fallback in ops/failure_playbook.md in the same PR, keeps getting traces on canary rollouts |
| What happens on a regression | Manual rollback on the gateway, post-mortem doc drafted later, unclear which cases regressed | Build fails on rubric delta, the PR cannot merge, the old primary keeps serving traffic, the scorecard names the failing cases |
| Who runs the qualification job | A vendor-hosted runtime that owns the eval traces and the rate limits | Your engineer on your GitHub Actions runner, on your keys, your cloud |
| Licensing cost per new release evaluated | Per-seat evaluation platform fee, sometimes per-model-tested fee | $0 in platform license, CI minutes + API tokens only |
“Three files, one scorecard, one decision. Every new LLM release goes through the same PR shape on every one of our shipped agents. The engagement is not really six weeks long; it is six weeks to earn the right to evaluate every future release in an afternoon.”
PIAS rubric, from the day-42 leave-behind on Monetizy.ai, Upstate Remedial, OpenLaw, PriceFox, OpenArt
Receipts
The numbers below are file counts and engagement-level facts, not invented benchmarks. Per-client production metrics are on /wins.
The 0 -file rule is the discipline that keeps the audit trail readable three releases later. A qualification PR that edits more than three files is not a qualification anymore; we split it, because 0 USD in platform license is only useful if the process stays clean enough to repeat.
Want every future LLM release to be a one-afternoon PR on your repo?
60-minute scoping call with the senior engineer who would own the build. You leave with a one-page week-0 memo: the rubric, the case set shape, the model-primary and fallback policy, the weekly rate.
Book the scoping call →Qualifying new LLM releases, answered
What does "production ready LLM releases enterprise" actually mean in this rubric?
It means one and only one thing on a PIAS leave-behind: can the PR that flips model_primary in rubric.yaml clear the same thresholds the agent was accepted under in week 2, on the case set that was captured during the week-0 scoping call. The case set is eval/cases.yaml on main, the thresholds are rubric_min_score 0.82 and ragas 0.78, and the workflow is .github/workflows/eval.yml. If the PR clears it, the release is production-ready for your workload. If it does not, it is not, regardless of what the provider changelog says.
Why a PR against your own repo and not a vendor evaluation platform?
Because the case set that matters is yours, not theirs. A vendor platform benchmark can only score against cases the vendor curated. The week-0 rubric was written by the engineer with your product team, capturing the failure modes that matter for your workload: a wrong outbound is a regulatory incident on Upstate Remedial, a broken scene is a retried render on OpenArt, a miscited case is a red-team failure on OpenLaw. A platform that reports MMLU going up 2 points cannot tell you any of that. A CI job running eval/cases.yaml on your keys can.
What exactly changes in a qualifying PR? Three files, no exceptions?
One line in rubric.yaml (model_primary), one short diff in ops/failure_playbook.md to promote the old primary to fallback, and the scorecard artifact that GitHub Actions attaches to the PR. Nothing in agents/*.py changes on a pure model swap. If the orchestration code changes, it is not a model qualification, it is a feature PR with a model change bolted on, and we split it. A disciplined three-file boundary is what makes the audit trail readable three releases later.
What happens if the new release regresses on one case but clears every threshold?
The per-case regression check in eval.yml fails the build if any case drops more than 3 points versus the previous model_primary. On a smaller regression, the scorecard surfaces it and the PR requires a second reviewer plus a note in the allowlist file explaining why this case is acceptable (for example, the regressed case is a known adversarial prompt the new model is less brittle on in a different way). A tolerated regression is a documented one, with an author and a date. Untracked regressions are how small drift turns into the next incident.
How is this different from using an LLM gateway like LiteLLM, Portkey, or a vendor router?
A gateway routes traffic at runtime. It does not score your model against your rubric. You still need an eval harness to decide whether to send traffic through the gateway to a new model at all, and most teams do not have one, which is why they default to "wait and see". The PIAS leave-behind inverts it: the eval harness is the gate, the gateway (if you run one) is downstream of the merged PR. We are neutral on whether you run LiteLLM, route directly to provider SDKs, or use Bedrock / Vertex routing. The qualification job is the same.
What if we already have a production agent and no rubric file? Can we still use this shape?
Yes, and a good chunk of our engagements start there. Week 0 is a read of your existing agent: named failure modes with no fallback, nodes with no eval signal, production traces with no scoring. Week 1 is the first PR that writes eval/cases.yaml from the last 30 days of real traffic plus the adversarial rows the rubric needs. Week 2 is the gate against that case set. From week 6 forward, every new LLM release is a three-file PR against the same file set. We do not require you to replace your orchestration. We require a case set we can score against.
What model vendors are actually covered? Do you pin to Anthropic or OpenAI?
We have shipped with Anthropic, OpenAI, Bedrock (Anthropic + Mistral), Vertex (Gemini), Azure OpenAI, and on OpenArt we run a multi-provider inference fleet the client operates. The five production agents use different primaries and different fallbacks. The rubric file makes model_primary a string, not an opinion, and the failure playbook names the fallback explicitly. Because the qualification job does not care which SDK the provider ships, a new release from any provider uses the same three-file PR.
How long does a qualification PR actually take to run end-to-end?
On a typical case set (80 cases plus ragas dataset), the CI job runs between 6 and 14 minutes on a standard GitHub Actions runner. The human side — branch, edit, push, read the scorecard — is a few minutes. From provider release announcement to a reviewed merge is usually the same afternoon, sometimes the same hour if the reviewer is online. On larger case sets (PriceFox runs on the last 24 hours of real traffic) the job takes longer and runs on a self-hosted runner with matrix parallelism, but the PR shape is identical.
What does a scoping call produce that makes this possible in week 6?
The scoping call produces three artifacts: the week-0 rubric memo (a one-page document naming the problem, the failure modes, the rubric thresholds, and the case set shape), a first cut of eval/cases.yaml (the 10 to 20 cases that define "correct" for your workload, written by the engineer with your product lead), and a named fallback policy (what happens when the primary is unavailable or regressed). Those three artifacts become the spine of every future LLM qualification. Every new release is scored against them. That is why the engagement is six weeks, not a quarter: the point of the six weeks is to earn the right to evaluate every future release in an afternoon.
Adjacent guides
More on the six-week rubric
AI agents in production: the 6-week contract rubric
First PR in 7 days, week-2 prototype gate, week-6 leave-behind. The rubric that produces the files this page qualifies releases against.
The 6-week FDE engagement model
How the scoping call produces the rubric, the case set, and the fallback policy. The three artifacts that make every future release a one-PR qualification.
Shipped systems, cited on the record
Monetizy.ai, Upstate Remedial, OpenLaw, PriceFox, OpenArt. Named clients, production metrics, per-system stacks and fallbacks.