Guide, topic: ai pilot to production

Production is not a milestone. It is a rubric.yaml, two CI gates, and one PR.

Most writing on this topic is a narrative: the Five V's, change management, data strategy, the 88 percent failure rate. None of it names the file in your repo that decides whether a pilot is in production today. This is the file. The two required CI gates that read it. The promotion PR. The numeric threshold wired to a refund webhook. The artifacts from the last five engagements.

M
Matthew Diakonov
15 min read
4.9from 5 named production agents, same rubric.yaml + two-gate shape
rubric.yaml on main in week 2, read by both required CI gates
pilot-gate.yml threshold miss fires the billing refund webhook automatically
production-gate.yml blocks the promotion PR unless 7 nightly runs are green

Shipped on Monetizy.ai (Bedrock us-east-1), Upstate Remedial (Anthropic direct), OpenLaw (ZDR + Splunk), PriceFox (Vertex multi-tenant), OpenArt (per-scene DAG, mixed model fleet).

rubric.yaml, .github/workflows/pilot-gate.yml, .github/workflows/production-gate.yml, eval/cases.yaml, runbook/<agent>.md, flags/<agent>.yaml

rubric.yaml gates sectionpilot-gate.ymlproduction-gate.ymlrubric_min_score 0.82per_case_regression_max 3promote: <agent> pilot -> productionfeature flag 10 -> 100prorated_refund_and_exitno_handoff_no_invoiceeval/cases.yamlrunbook/<agent>.mdmodel vendor neutralAnthropicOpenAIBedrockVertex

The slide that quietly kills pilots

Every guide that currently exists on this topic walks you through frameworks, change management, stakeholder alignment, and data governance. All of that matters in some abstract sense, and none of it is what actually moves a pilot into production in an enterprise repo. What moves a pilot into production is a commit.

The POC demo lands. The board sees a 94 percent accuracy slide. Somebody writes the phrase "scale to production" into next quarter's OKRs. Twelve weeks later the same agent is still a pilot. Every incremental PR adds a config knob, nobody knows which score threshold decides production, no one on the team can point at a file on main and say the pilot graduated on this commit. The exit ramp from pilot is a deck review, not a merge.

The mistake is letting the word "production" stay a slide. The fix is to define it, once, as a file in the repo, and to make two required CI checks read that file every week. Everything else in a PIAS engagement follows from this.

Five things a pilot without the file gets wrong

Each card is a real failure mode we have watched, on at least one engagement, before the pilot-to-production wiring landed. Each one is fixed by a specific artifact in the repo.

The word "production" is a slide, not a commit

Every stakeholder has a private definition. Sales means the agent answers demo questions. SRE means it has a runbook. Compliance means a BAA is signed. Finance means ROI is measurable. Nobody has the authority to promote the pilot because nobody can say which file flipped. The PIAS engagement writes that definition into rubric.yaml in week 2, and every stakeholder reads the same threshold.

Threshold drift between the POC eval and the production cases

The POC rubric was three cases an engineer wrote on a Tuesday. The production rubric is 40 cases that reflect the actual input distribution. Most teams keep patching the old POC rubric. We replace it in week 2 with eval/cases.yaml and make the new rubric_min_score the gate.

No owner for the promotion PR

The promotion PR is titled promote: <agent> pilot -> production. It flips one feature flag from 10 percent to 100 percent. Its diff is one config file. Its merge is gated by production-gate.yml being green. The engagement owner authors it. Most teams never write it because nobody knows what should be in it.

Refunds are words in an email, not a workflow step

The week-2 refund clause only works if it is wired to a threshold. pilot-gate.yml reads rubric.yaml, runs the eval against the staging flag, and posts a PR comment with the score. If the score is under 0.82 on the week-2 snapshot, the workflow opens an issue and the invoice is paused. No renegotiation, no meeting.

Handoff gets skipped and the pilot stays a pilot forever

Without production-gate.yml requiring a runbook/<agent>.md file, an eval/cases.yaml file with at least 30 rows, and the week-6 handoff checklist green, the feature flag never goes to 100 percent. The PR sits open. The gate is the only reason the handoff actually happens.

Anchor fact: six files, two required checks, zero dashboards

Every PIAS engagement lands the same six artifacts on main. The file names are identical across the five shipped agents, so an engineer who has worked on one can read the others cold. Nothing is hosted by us.

Anchor fact

Production is a green check on a one-file PR.

  1. rubric.yaml. Carries the numeric thresholds, the owner field, the production flag path, and the gates section with pilot_gate and production_gate config. Single source of truth.
  2. .github/workflows/pilot-gate.yml. Required check from day 14. Reads rubric.yaml, scores eval/cases.yaml, comments the score history on the PR, fires the billing refund webhook on Monday-snapshot misses.
  3. .github/workflows/production-gate.yml. Required check from day 42. Runs only on the promotion PR. Asserts title prefix, one-file diff, trailing seven nightly runs green, runbook present, handoff checklist complete.
  4. eval/cases.yaml. Grows from 30 rows at week 2 to 40+ rows by week 6 via weekly tail-mining PRs. The corpus size floor is a gate check.
  5. runbook/<agent>.md. 300+ words, five canonical sections (alerts, oncall, rollback, eval-dashboards, owner-rotation). Read cold by three on-call engineers in week 5.
  6. flags/<agent>.yaml. The feature flag the promotion PR flips from 10 to 100. Its diff is the entire PR. Any other change in the same PR fails the diff-contract check.

rubric.yaml, the file that defines production

The exact shape we ship. The gates section is what the two workflows read. The promotion_pr section is what production-gate.yml enforces on the merge diff. When a stakeholder wants to change the bar, they open a PR to this file, and the gates update in lockstep.

rubric.yaml

.github/workflows/pilot-gate.yml, the week-2 check

Three jobs. read-rubric loads the thresholds. run-eval scores the agent against eval/cases.yaml and comments on the PR. refund-signal only runs on the scheduled Monday snapshot, only when run-eval failed, and only then fires the billing webhook. This is what makes the refund clause automatic instead of a negotiation.

.github/workflows/pilot-gate.yml

.github/workflows/production-gate.yml, the week-6 check

Six independent jobs. title-contract, diff-contract, trailing-seven-nightly, runbook-present, eval-corpus-size, handoff-checklist. Each one is a specific assertion. Any failure blocks the merge and triggers the no_handoff_no_invoice escalation. The gate runs only on the promotion PR, so the cost of green is paid exactly once per agent.

.github/workflows/production-gate.yml

The promotion PR, in full

The entire diff. One file, one line changed. Everything else is a workflow check, a CODEOWNERS rule, or a comment the bot attaches for reviewers. This is the "production" event. If you cannot point at a PR like this in your repo, you do not have an agent in production, you have a pilot running at a higher rollout percentage.

PR #1043 -- promote: discharge-summary-drafter pilot -> production

The wiring: rubric in, gates middle, outcomes out

Left: the inputs the gates consume. Center: the two required CI checks. Right: the outcomes each gate produces. Every arrow is a file path or a webhook, nothing here is metaphorical.

Anatomy of the promotion: one rubric, two gates, one PR

rubric.yaml (week 2)
eval/cases.yaml
nightly cron snapshot
two required gates
refund webhook (on fail)
promotion PR (on pass)
runbook + handoff (week 6)

The two gates, not eight stages

Most pilot-to-production frameworks define 5 to 10 phases. Phases are expensive, because every phase invites a meeting. A six-week engagement cannot absorb eight meetings. So the pilot-to-production wiring has exactly two decision points: the week-2 gate that decides trajectory, and the week-6 gate that decides handoff. Everything else is a merge on green.

The two required gates, nothing between

  1. 1

    Week 2: pilot-gate.yml

    The first required check. Reads rubric.yaml, scores the pilot against eval/cases.yaml on every PR and every Monday 09:00 UTC. If the Monday snapshot misses the threshold, the billing webhook fires, the invoice pauses, and an issue opens. No meeting, no renegotiation.

  2. 2

    Week 6: production-gate.yml

    The second required check. Runs only on the promotion PR. Asserts the title prefix, the one-file diff, seven green nightly runs in a row, runbook present with the five canonical sections, eval corpus at least 30 rows, and the handoff checklist fully checked. If any of these miss, no handoff and no final invoice.

A literal Monday miss, traced

Captured from a real week-3 snapshot on a Monetizy-style engagement. The score missed by 0.03. The regression count was 5 against a cap of 3. No human clicked anything. The webhook fired, the invoice paused, the issue opened. One week later, the snapshot was green, billing resumed, and a prorated credit was applied.

pilot-gate.yml -- Monday snapshot miss, refund signal, recovery

The gate, the webhook, the paused invoice

The sequence that makes the week-2 refund clause operational. Every actor is a file or an endpoint in your infrastructure, not in ours. The engagement owner is pinged by the issue, not by a consultant deciding to tell you.

pilot-gate.yml fires on Monday, billing auto-pauses the invoice

engineer PRGitHub Actionsrubric.yamleval/cases.yamlbilling webhookengagement ownerMonday 09:00 UTC cron tickread rubric thresholdsmin_score 0.82, reg_max 3score the 41 casesweighted 0.79, reg 5POST .pilot-gate/latest.jsonpage + open refund issueinvoice paused, 7d credit queued

The shift: from narrative pilot to rubric-gated pilot

Toggle the tab to see the shape of the same pilot before and after the wiring lands. The inputs, the team, and the agent are the same. The artifact that decides production is different.

Pilot that never graduates vs. pilot wired to gates

A POC eval with three cases, a feature flag stuck at 10 percent, and a roadmap slide that says production by Q3. Nobody owns the promotion PR. The word production drifts: one team means demo, one team means monitored, one team means the lawyers approved. The week-2 refund clause is a sentence in an MSA email. The week-6 handoff is a Zoom recording no on-call engineer has watched.

  • no file on main defines production
  • no CI check enforces the threshold
  • refund clause is not wired to a number
  • handoff is a meeting, not a merge
  • promotion PR does not exist

How this differs from the common playbook

Left: the PIAS two-gate wiring. Right: the shape most teams end up with when the pilot-to-production process is a deck and a steering committee. Every row is a difference we have watched matter on a real engagement.

FeatureNarrative pilot-to-productionPIAS two-gate wiring
What "production" meansA slide in the QBR deck. Different stakeholders hold different definitions because nothing in the repo forces them to share one.A feature flag flipped from 10 to 100 percent in a single PR whose merge is blocked by production-gate.yml until every numeric threshold passes.
Who decides the pilot is readyA steering committee that meets every two weeks, half of whom have not touched the repo in a month, deciding on screenshots.rubric.yaml does. Two required GitHub checks read it. Humans can argue, but they argue about a PR to rubric.yaml, not about the agent.
How the week-2 refund clause is enforcedA sentence in the MSA that requires someone to schedule a call, escalate to legal, wait for an invoice adjustment, and lose the moment.pilot-gate.yml cron fires on Monday 09:00 UTC. If the threshold misses, a webhook POSTs to billing and an issue opens. The invoice is paused by Monday 09:02.
What stops a random teammate from declaring the pilot is in productionUsually a Slack thread. Occasionally a wiki page that gets edited after the fact. Once, in an engagement we watched, a VP declared production over lunch.The feature flag diff is the only thing that matters, and production-gate.yml blocks its merge unless seven nightly runs are green in a row and the runbook is present.
How the pilot eval corpus growsFrozen at whatever the POC shipped with. New failure modes from production traffic are discussed, not tested.eval/cases.yaml starts at 30 rows in week 2 and grows via weekly tail-mining PRs. production-gate.yml rejects the promotion PR if the corpus has fewer than 30 rows.
Vendor and platform dependencyA third-party MLOps platform with a UI for pilots and another UI for promotions, a seat license, and a data residency story your compliance team has not finished reading yet.rubric.yaml, pilot-gate.yml, production-gate.yml, eval/cases.yaml, runbook/<agent>.md, flags/<agent>.yaml. Six files, all in your repo, no PIAS-hosted services, no platform license.
What the on-call engineer sees when the agent misbehavesA Confluence page titled ProjectName Deployment that has not been edited since the original launch celebration.runbook/<agent>.md with alerts, oncall, rollback, eval-dashboards, owner-rotation. 300+ words. Read cold by an engineer who was not on the engagement.

The week-6 handoff checklist, in the repo

Nine lines. Each line is an assertion production-gate.yml can check, or a human fact the engagement owner attests to in the PR comment. The boxes are checked in runbook/_handoff.md, and the gate rejects the promotion PR if any box is unchecked.

runbook/_handoff.md, at merge time

  • rubric.yaml on main, gates section present, owner field set
  • pilot-gate.yml required check, green for last 7 nightly runs
  • production-gate.yml required check on promotion PR
  • eval/cases.yaml at 30+ rows, tail-mining PRs merged weekly
  • runbook/<agent>.md has alerts, oncall, rollback, eval-dashboards, owner-rotation
  • feature flag flipped to 100 in flags/<agent>.yaml, merged on green
  • runbook/_handoff.md read cold by three on-call engineers
  • engagement owner available for 12 months of 2-hour paid consults, capped rate
  • No PIAS-hosted runtime. No platform license. No retainer.

The same two gates, five different agents

Each card is a production agent with the wiring live on main. Provider, domain, and stack differ. The file shapes do not.

Monetizy.ai -- outbound email orchestrator

Pilot-gate.yml threshold rubric_min_score 0.82. Missed the week-3 Monday snapshot, refund fired, 7-day prorated credit auto-applied. Passed the week-5 snapshot, promoted on a one-line diff to flags/email_orchestrator.yaml at week 6.

Upstate Remedial -- 400K+ debt-notice emails

eval/cases.yaml grew from 34 to 52 rows across weeks 3 to 5 via tail-mining. Production-gate.yml rejected the first promotion PR because runbook/_handoff.md had one unchecked box. Fixed and merged 40 minutes later.

OpenLaw -- AI-native legal editor

rubric.yaml carries both the editor rubric and the compliance section. production-gate.yml consumes both on the promotion PR, so a legal-domain regression or a compliance drift blocks the same feature flag flip.

PriceFox -- multi-tenant retrieval

Each tenant has its own rubric_min_score in rubric.yaml under tenants[]. production-gate.yml runs the nightly trailing check per tenant. One tenant can be in production while another is still gated in pilot, in the same repo.

OpenArt -- per-scene DAG, mixed model fleet

Promotion PR flipped flags/scene_composer.yaml from 10 to 100. production-gate.yml's trailing-seven-nightly check caught an Opus-to-Sonnet regression on one node. The PR sat open for four days until the node was rerouted and the snapshot went green.

2

Two required gates, six files, one rubric. The word production is not a slide in any of the five engagements; it is a one-line diff in flags/, merged on green by two reviewers, blocked by a workflow that counts regressions and reads a runbook. That is the whole pilot-to-production event.

PIAS leave-behind on Monetizy.ai, Upstate Remedial, OpenLaw, PriceFox, OpenArt

Receipts

File counts and threshold facts, not invented benchmarks. Per-client production metrics are on /wins.

0Files that make up the pilot-to-production wiring in your repo
0Required CI gates (pilot-gate.yml, production-gate.yml)
0.0rubric_min_score threshold enforced by pilot-gate.yml
shipped on 0Production agents on this exact two-gate pattern

The 0-file rule is the same discipline that turns the word production into a 0-file PR. The week-2 refund clause is wired to a single threshold: 0.0. Everything else, 0 USD in platform license, 0 vendor-hosted runtimes, on purpose.

Want the two gates and the rubric.yaml in your repo in six weeks, not a 40-page pilot-to-production deck?

60-minute scoping call with the senior engineer who would own the build. You leave with a one-pager: the agent, the rubric thresholds, the feature flag path, and a named engineer to deliver it.

AI pilot to production, the questions engineering and procurement actually ask

What makes the PIAS pilot-to-production model different from writing a roadmap deck?

Three things. First, the thresholds live in rubric.yaml in your repo, so any stakeholder who wants to change them opens a PR. Second, two required GitHub Actions (pilot-gate.yml and production-gate.yml) enforce those thresholds on a schedule and on the promotion PR, so the word "production" is not a slide, it is a green check. Third, the week-2 refund clause is wired to a numeric threshold and a webhook, not to a meeting. A roadmap deck cannot pause an invoice automatically. Our pipeline can, and has, on a real engagement, at 09:02 UTC on a Monday.

What files are actually added to our repo to wire this up?

rubric.yaml (the thresholds and gate config), .github/workflows/pilot-gate.yml (the week-2 required check), .github/workflows/production-gate.yml (the week-6 required check on the promotion PR), eval/cases.yaml (the growing case corpus, 30+ rows at gate time), runbook/<agent>.md (the on-call playbook, 300+ words, five canonical sections), and flags/<agent>.yaml (the rollout_percent flag the promotion PR flips). Six files, all version-controlled, all in your repo. Nothing is hosted by us.

What does the promotion PR actually look like in git history?

Its title starts with "promote: <agent> pilot -> production". Its diff is a single file, flags/<agent>.yaml, with rollout_percent: 10 replaced by rollout_percent: 100. Any other change in the same PR makes production-gate.yml's diff-contract job fail. Prompt edits, retrieval updates, and model swaps each merge in their own PRs gated by pilot-gate.yml before the promotion PR is opened. The promotion PR is intentionally boring, and that is the point.

What happens if the week-2 threshold misses on a Monday snapshot?

pilot-gate.yml's refund-signal job runs automatically after the eval job fails. It POSTs the snapshot to the billing webhook configured in rubric.yaml, which pauses the open invoice. It then opens a GitHub issue against the engagement owner, labeled refund-triggered and billing-paused, and links the failing workflow run. We see it, you see it, and neither side has to write an email or schedule a call. If the next Monday snapshot is green, billing resumes and a 7-day prorated credit is auto-applied. If it misses three Mondays in a row, the engagement ends per the MSA exit clause, code stays with you, and you keep every PR merged to that point.

Why two gates instead of one? Isn't the week-6 gate enough?

Because the week-2 gate does a different job. Its purpose is to make sure the prototype is on a trajectory that can reach production, and to make the refund clause operational. Without pilot-gate.yml, a six-week engagement only has one decision point at the end, and that point is dominated by sunk-cost bias. With two gates, the week-2 snapshot is an early, cheap decision: continue on a working trajectory, or refund and exit while the cost of switching is small. The week-6 gate is then free to focus on production readiness instead of feasibility.

We're not Anthropic-only. Does this work with Bedrock, Vertex, OpenAI, Azure?

Yes. rubric.yaml has a model_provider field per agent, and pilot-gate.yml reads it to pick the SDK that runs the eval. All five shipped engagements use this file shape, and they span Anthropic direct, Bedrock, Vertex, and OpenAI. The gates never hard-code a provider. This is part of what "model vendor neutral" means in the SOW.

What goes in runbook/<agent>.md and why does production-gate.yml check it?

Five canonical sections, in this order: alerts (what pages who, on which metric, at which threshold); oncall (the rotation, the pager policy); rollback (the exact revert PR template, the flag to flip back to 10 percent); eval-dashboards (links to the trailing-seven-nightly chart and the per_case_regression chart); owner-rotation (who owns the runbook for the next 12 months and how the handoff is re-opened). production-gate.yml's runbook-present job asserts each section is present and the file is at least 300 words. This is what separates "the agent passes the eval" from "the agent is production," and it is a check because otherwise it gets skipped.

We already have an MLOps platform. Do we have to rip it out?

No. rubric.yaml is a YAML file and the two gates are GitHub Actions. They do not compete with your platform, they sit above it. If your platform runs the eval, pilot-gate.yml calls into it; if it hosts the runbook dashboard, the runbook links to it. The only requirement is that the thresholds, the gate config, and the promotion diff live in your repo as text, so a PR (not a UI click) is the unit of change.

How is the promotion PR's merge protected from a surprise change?

Three ways. First, production-gate.yml's diff-contract job fails if any file other than flags/<agent>.yaml is touched. Second, two-of-two reviews are required from the engagement owner and the client lead (CODEOWNERS routes this automatically). Third, the PR's title must match the canonical prefix, so a curious engineer cannot rename it to merge silently. If a stakeholder insists on shipping a prompt tweak with the promotion, that tweak merges first as its own pilot-gate-guarded PR, then the promotion PR is rebased.

How quickly can you set this up in an existing repo?

The six files are a week-1 and week-2 PR pair. Week 1: pilot-gate.yml, rubric.yaml, eval/cases.yaml, flags/<agent>.yaml land in warn-only mode so we can calibrate thresholds against your actual staging traffic. Week 2: the thresholds in rubric.yaml are tightened, pilot-gate.yml becomes a required check, and production-gate.yml lands as a queued required check (it has no effect until the promotion PR exists). That is the whole setup. After week 2, every PR that touches the agent is scored, and every week we are honest about whether the trajectory holds.