Guide, keyword: ai agents in production

Most guides survey the field. This one is the contract rubric.

Search “ai agents in production” and the top ten results are adoption surveys, framework explainers, and rubric frameworks. None of them publishes a six-week calendar with named exit clauses, named clients, and the specific artifacts that land in your repo at the end. This guide is that calendar. Five production AI agents have already been shipped against it.

Matthew Diakonov, PIAS, forward-deployed ML engineering

Published April 20, 202613 min read

4.9from named clients, production metrics

First PR by day 7, or weekly billing pauses

Week 2 rubric gate or prorated refund and exit

Runbook, eval harness, CI/CD leave in your repo on day 42

6 weeks. 5 shipped agents. One rubric you can sign.

Pause clause, refund clause, leave-behind clause

Week 1: first PR in 7 days or billing pauses

Week 2: prototype on your rubric or prorated refund

Weeks 3 to 5: production hardening, your stack

Week 6: runbook, eval harness, CI/CD land in your repo

12 months: senior engineer on 2-hour paid consults

0:00 / 0:08

Your stack, your keys, your repo. No PIAS-hosted runtime.

Monetizy.ai / Upstate Remedial / OpenLaw / PriceFox / OpenArt

Book the scoping call

Pydantic AILangGraphCustom DAGAnthropicOpenAIBedrockVertexpgvectorragasLangSmithOTELGitHub ActionsPostgres auditCanary rolloutsMCPA2A

Why the SERP cannot answer this question

The top results for ai agents in production are Gartner adoption statistics, a Google Cloud Agent Builder explainer, a LangChain rubric framework, a Galileo evaluation survey, and a DeepEval comparison table. Each is useful on its own axis. None of them commits a specific team to a specific rubric on a specific calendar with specific exit clauses you can hold them to.

They cannot. A survey article is a piece of content, not a contract. A framework explainer is written by a vendor whose business model depends on you adopting their framework. A rubric framework is an abstraction. This page is a contract rubric that already shipped five production systems with named clients, named metrics, named stacks, and named exit clauses.

“Production AI agents shipped against this exact six-week rubric since 2024. First PR by day 7 on every engagement. Zero week-2 exits used, which is the success case: the clause is alignment, not statistics.”

PIAS case studies: Monetizy.ai, Upstate Remedial, OpenLaw, PriceFox, OpenArt

Anchor fact: the three clauses the rubric is built around

Most “ai agents in production” engagements die in one of three places: integration (week 4 realizes the vendor prototype cannot reach your repo), evaluation (the demo looked great but fails on your data), and handoff (the vendor leaves and your team inherits a black box). The PIAS rubric puts a dated, contractual clause on each of those three points.

Anchor fact

7, 14, 42 days. Three clauses, one MSA.

Day 7 / week 1 PR clause. First merged PR lands in your repo by end of day 7. If it does not, weekly billing pauses until it does. The engineer is in your CI, not on a demo URL.
Day 14 / week 2 gate clause. A running prototype scores against the rubric you wrote on the scoping call. If the rubric gate fails, prorated refund, engagement ends. Clause is in the MSA.
Day 42 / week 6 leave-behind. Orchestration definition, eval harness (ragas plus the week-0 rubric) in your GitHub Actions, failure playbook, runbook keyed to your on-call. Plus a 12-month option on 2-hour paid consults with the same engineer.

Survey guide vs. contract rubric

Left column: what the PIAS rubric commits to, in writing. Right column: what a typical “ai agents in production” engagement actually delivers. The rows are not a caricature. They are the shape of every post-mortem we have read from teams that hired the vendor on the other side of this table.

Feature	Typical AI agent engagement	PIAS contract rubric
First signal the engagement is working	Kickoff deck, stakeholder interviews, discovery doc	First merged PR into your repo by end of week 1, or weekly billing pauses until it lands
Week 2 gate	Revised scope, change order, extension to sprint 3	Running prototype scored against the rubric you wrote on the scoping call, or prorated refund and exit
Where the code lives	Vendor platform, their cluster, their secrets	Your monorepo, your CI, your cloud, your keys
Evaluation	Manual spot checks on a shared demo URL	Case-specific rubric plus ragas in your GitHub Actions, failing the build on regression
Failure domain named on day 1	Hallucination risk mentioned in appendix, unmitigated	A wrong outbound is a regulatory incident, a wrong scene is a retried scene
What leaves in week 6	Access to a dashboard behind a vendor login	Orchestration code, eval harness, failure playbook, on-call runbook
After the engagement ends	Renewal conversation driven by an account manager	Same engineer available for 2-hour paid consults at a capped rate for 12 months

The five shipped agents, and what the rubric held them to

Each card below is a named production AI agent with the framework choice that shipped, and the one-sentence version of the week-0 rubric. Client names, stacks, and metrics are on /wins. The rubric clauses live in the MSAs and are the same on every engagement.

Monetizy.ai: Pydantic AI, shipped in 1 week

Auto-orchestrated email campaign at around 8K messages per day. Pydantic AI plus Anthropic and OpenAI with pgvector retrieval. Scored pipeline, not a conversation. First PR landed on day 4.

Upstate Remedial: LangGraph, 400K plus emails

Legal compliance flow for auto-debt notices. Bedrock primary, OpenAI as a conditional-edge fallback, a deterministic compliance node on every drafter turn, and one Postgres audit row per transition. Week 2 rubric included on-record legal sign-off.

OpenLaw: Anthropic plus citation-verification subagent

AI-native law editor, publicly released. Domain retrieval, a separate citation-verification agent, and a red-team eval rubric scored by licensed attorneys in the evaluation harness. Week 6 leave-behind included the attorney review playbook.

PriceFox: industry-leading automated eval CI

Multi-tenant retrieval agent with pgvector, reranking, canary rollouts, and an automated evaluation pipeline in GitHub Actions. Nightly regression gates, OTEL traces, humans only on threshold breaches. Rubric ran itself in CI from week 3.

OpenArt: custom scene-graph DAG

Commercial video auto-generation with per-scene quality gates and prompt-repair retries. Custom DAG because the state is a tree of sub-states, not a turn log. First PR on day 5, prototype gate hit on day 13.

The six-week calendar, in order

This is the literal week-by-week calendar printed on the scoping memo. Read it as a rubric, not as a timeline: each row names the artifact that must exist by end of that week for the engagement to continue under the MSA. Week 1 and week 2 are the ones with financial teeth.

Week 0: scoping and rubric, on the record

90-minute scoping call with the named senior engineer who will own the build. You leave with a one-page memo: the problem shape, the framework the engineer would pick, the week 2 gate rubric, named inputs and outputs, and a weekly rate. No framework picked on the call is billed as a commitment.

Week 1: first PR in 7 days or billing pauses

The engineer is in your repo on day 1 with seat access, CI access, and a branch. A reviewable PR lands by end of day 7. If it does not land, weekly billing pauses until it does. No change orders, no re-scoping calls, no explaining why integration took longer than expected.

Week 2: prototype on your rubric, or prorated refund and exit

By end of week 2, a running prototype executes the happy path end-to-end and scores against the rubric you wrote in week 0. If the rubric gate fails, you get a prorated refund and the engagement ends. The clause is in the MSA. No PIAS engagement has hit this exit so far, but the clause is the point.

Weeks 3 to 5: production hardening, your stack, your keys

Fallback routing, eval in CI, audit traces, canary rollouts, SLOs named and alerted. Everything lands in your monorepo, your cloud account, your secret manager. No PIAS-hosted runtime. No vendor platform in the critical path. On-call pages your team from day one of week 5.

Week 6: handoff with runbook, eval, CI/CD in your repo

Four artifacts on main: the orchestration definition, the eval harness (ragas plus the week-0 rubric) running in your GitHub Actions, the failure playbook naming fallback models and the audit-log schema, and a runbook keyed to your on-call rotation. 90-minute handoff session with the team, then the engineer leaves.

12 months after: same engineer on 2-hour paid consults

For 12 months after handoff, the same engineer is available for 2-hour paid consults at a capped rate. No retainer, no SaaS, no account manager. Your team is the owner; the engineer is on call when you hit a regression or a new failure mode and want a second pair of eyes.

What the week-0 rubric actually looks like

The scoping memo is a YAML file, not a slide deck. It is checked into your repo on day 1 and the week-1 PR clause, the week-2 gate clause, and the week-6 leave-behind all point at named keys in it. When the week-2 gate fires, there is no ambiguity about what “rubric” means: it is line 14 of this file.

rubric.yaml

What the eval harness looks like on day 42

The week-6 leave-behind includes a GitHub Actions workflow that runs on every PR touching the agent. It runs the case-specific rubric (thresholds from the week-0 memo) and ragas faithfulness plus answer_relevancy, and fails the build on regression. When the engagement ends the only thing that changes is the name on the commits. The workflow keeps running.

.github/workflows/eval.yml

The wiring shape the rubric actually ships

Every engagement lands on the same diagram at the boundary: your inputs and your repo on one side, the six-week rubric as the orchestrator in the middle, your production agent plus the four leave-behind artifacts on the other. Nothing PIAS-hosted in the middle, on purpose.

6-week rubric: your inputs, the rubric gate, your production agent

Receipts

The numbers below are from named production systems. No invented benchmarks, no sector averages. Counts and metrics are checkable on /wins.

0Production AI agents shipped against this rubric, named on /wins

0Days from engagement start to first merged PR, or billing pauses

0K+Emails on the Upstate Remedial LangGraph system

0K/dayMessages on the Monetizy.ai Pydantic AI system

The 0K+ email count is on the LangGraph system at Upstate Remedial, shipped against this rubric. The 0K/day throughput is on the Pydantic AI system at Monetizy.ai, first PR landed on day 4 of the engagement.

AI agents in production, answered

What does "AI agents in production" mean in the PIAS rubric, specifically?

An agent is in production when four artifacts live on main in your repo: the orchestration definition (LangGraph, Pydantic AI, or a custom DAG), an eval harness (ragas plus a case-specific rubric) running in your GitHub Actions and failing the build on regression, a failure playbook naming the fallback model and the audit-log schema, and an on-call runbook keyed to your rotation. Anything short of those four is a demo, not a production agent. The rubric thresholds were written on the week-0 scoping call and the week-2 gate is scored against them.

What actually happens in week 1 of a PIAS engagement?

The named senior engineer is in your repo on day 1 with seat, CI, and branch access, and a reviewable PR lands by end of day 7. The PR is not a refactor or a discovery doc: it is the first piece of the happy-path agent executing end-to-end on a slice of your real input data, with an initial eval case checked in. If the PR does not land in seven days, weekly billing pauses until it does. The clause is in the MSA, not on a slide.

What is the week 2 gate and what happens if we fail it?

By end of week 2, a running prototype executes the happy path end-to-end and scores against the rubric you wrote with the engineer in week 0 (a min rubric score, typically around 0.82 on a case-specific scale, plus ragas thresholds around 0.78 on faithfulness and answer_relevancy). If the rubric gate fails, you get a prorated refund for weeks 1 and 2 and the engagement ends. The clause is in the MSA. No PIAS engagement has hit this exit so far, across five shipped production systems; the clause is the alignment mechanism, not the expected outcome.

What leaves in our repo in week 6?

Four artifacts on main, no external dependencies on PIAS: the orchestration definition (for example graph.py for LangGraph, pipeline.py for a custom DAG, or an agents/ module for Pydantic AI), the eval harness (ragas plus the week-0 rubric) running in your GitHub Actions, a failure playbook that names the fallback model and the audit-log schema, and a runbook keyed to your on-call rotation. A 90-minute handoff session with the on-call team closes the engagement. You own all of it.

What does the 12-month option cost and how is it priced?

For 12 months after handoff, the same senior engineer is available for 2-hour paid consults at a capped rate. Not a retainer. Not a SaaS subscription. Not an account manager calling every quarter. You buy 2-hour blocks when you hit a regression, a new failure mode, or want a second pair of eyes on a roadmap change. Cap prevents ratcheting. Your team is the owner of the agent by design; the engineer is a named fallback, not a dependency.

Why is the week 1 PR deadline seven days and not two weeks?

Seven days forces environment access, repo conventions, CI permissions, and code review to happen in week 1 instead of week 4. That is where most AI pilots fail: the prototype works on a vendor laptop but cannot integrate with the client's monorepo, so integration is pushed to the end of the engagement and never finishes. We invert the order: integration is the first week, modeling work is weeks 2 to 5. The PR clause is the mechanical enforcement of that order.

How is this different from the "production AI agents" guides from Google Cloud, LangChain, and Gartner?

Those guides describe a framework or survey the industry. They do not commit a specific consultancy to a specific rubric with specific exit clauses on a specific calendar. This guide is the contract rubric five named production agents were held to: Monetizy.ai (around 8K emails per day on Pydantic AI), Upstate Remedial (400K plus emails on LangGraph), OpenLaw (publicly released), PriceFox (industry-leading automated eval CI), and OpenArt (custom scene-graph DAG). You can hold PIAS to this rubric. You cannot hold a survey article to anything.

Do you handle model and framework choice, or do we have to pick?

The engineer picks on the scoping call, with written rationale, and the rationale is checked in next to the orchestration file so a future engineer can re-evaluate. Across the five shipped systems we picked Pydantic AI once, LangGraph once, and a custom DAG three times. The framework is downstream of the problem shape, not an opinion we walk in with. If the framework you picked before hiring PIAS is the wrong fit for the problem, we will say so on the scoping call, with the reason, before you spend another sprint on it.

Where does the eval harness actually run, and who owns the dashboards?

The eval harness runs in your GitHub Actions (or your CI system of choice) on every PR that touches the agent code. Ragas plus the case-specific rubric from week 0, with thresholds that fail the build on regression. Traces land in your OTEL collector and your LangSmith project, on your keys. No PIAS-hosted dashboard. No vendor-hosted eval you have to log into. When the engagement ends, the only thing that changes is the name on the commits.

Can you retrofit this rubric onto an agent we already have in production?

Yes, and a chunk of our engagements start that way. Week 0 is a read of your existing code plus your existing traces: named failure modes with no fallback, nodes with no eval signal, transitions with no audit row. Week 1 is the first PR that fixes one of those. Week 2 is the rubric gate against the cases you actually care about. We do not rewrite working code. If the framework you picked is the wrong fit for the shape of the problem, we will say so before touching a line of your repo.

Want the rubric run against your agent, not ours?

60-minute scoping call with the senior engineer who would own the build. You leave with a one-page week-0 memo: the problem shape, framework pick, week-2 gate rubric, weekly rate.

Book the scoping call →

Adjacent guides

Most guides survey the field. This one is the contract rubric.

Why the SERP cannot answer this question

Anchor fact: the three clauses the rubric is built around

7, 14, 42 days. Three clauses, one MSA.

Survey guide vs. contract rubric

The five shipped agents, and what the rubric held them to

Monetizy.ai: Pydantic AI, shipped in 1 week

Upstate Remedial: LangGraph, 400K plus emails

OpenLaw: Anthropic plus citation-verification subagent

PriceFox: industry-leading automated eval CI

OpenArt: custom scene-graph DAG

The six-week calendar, in order

Week 0: scoping and rubric, on the record

Week 1: first PR in 7 days or billing pauses

Week 2: prototype on your rubric, or prorated refund and exit

Weeks 3 to 5: production hardening, your stack, your keys

Week 6: handoff with runbook, eval, CI/CD in your repo

12 months after: same engineer on 2-hour paid consults

What the week-0 rubric actually looks like

What the eval harness looks like on day 42

The wiring shape the rubric actually ships

6-week rubric: your inputs, the rubric gate, your production agent

Receipts

AI agents in production, answered

Want the rubric run against your agent, not ours?

More on shipping production AI

Multi agent orchestration: the framework selection matrix

The 6-week FDE engagement model

Shipped systems, cited on the record