Guide, keyword: multi agent ai orchestration

Every orchestration guide stops at production. We start at 2am on-call.

A multi agent AI orchestration system is not “done” when it runs. It is done when an on-call engineer who did not build it can diagnose a broken turn in under 10 minutes from the files in your repo. This guide documents the four leave-behind artifacts and the per-transition audit row that make that test pass. No vendor page can publish this rubric; the runtime they sell is the artifact they would otherwise have to hand over.

M
Matthew Diakonov
12 min read
4.9from named production systems, cited on /wins
4 files on main when we leave, not a vendor dashboard
1 audit row per node transition, in your Postgres
10-minute diagnosis target, measured from pager to fix

4 files. 1 audit row schema. 1 rubric your board can hold us to.

Named senior engineer. Your repo. Your Postgres. Your on-call.

Book the scoping call
graph.pypipeline.pyevals/ragas_rubric.py.github/workflows/evals.ymlrunbook.mdfailure_playbook.mdaudit_rows.sqlmodel_adapter.pypgvectorragasBedrockOpenAI fallbackMCPA2ALangSmithOTEL

Why the SERP cannot answer this question

Search multi agent ai orchestration and the top 10 results will each define the pattern, list a few frameworks, and stop at “runs in production.” None document what has to exist in your own repo after the build so an unrelated engineer can own a pager shift. The reason is business model: the runtime vendors sell IS the artifact they would otherwise have to hand over. Publishing the leave-behind rubric would mean publishing the exit path off their product.

We do not sell a runtime. We sell a named senior engineer who ships the orchestration, checks the four files into your mainbranch, and leaves. That lets us publish the rubric openly: here is what “done” looks like, file by file.

4

Files that must exist on main before we close an engagement on a multi agent AI orchestration build. graph.py or pipeline.py. evals/ with a nightly Actions job. runbook.md keyed to on-call. failure_playbook.md. No hosted dashboard, no vendor license, no attached runtime. The handoff is the product.

PIAS engagement rubric, shipped across Monetizy.ai, Upstate Remedial, OpenLaw, PriceFox, OpenArt

The four files, in your repo, on main

This is the literal directory your team will diff against at the end of the engagement. Every file sits in your repo, behind your branch protection, owned by your team after handoff. No hosted counterpart.

1. The orchestration source itself

graph.py or pipeline.py, in your repo, on main. Model adapters sit behind a small pick_primary / pick_fallback function so swapping Bedrock for Vertex is a one-line change. No imports from a vendor runtime. No decorators tied to a control plane we keep after we leave.

2. evals/ with ragas + a case rubric

Nightly GitHub Actions job runs ragas plus a case-specific rubric on a fixed golden set. Regression threshold breaches fail the workflow and page the on-call. The rubric is checked in, not hosted on a vendor dashboard you have to re-subscribe to.

3. runbook.md keyed to on-call

Not a confluence page. A file in the repo, versioned with the code. Names the dashboards, the failure edges, the expected latency budget per node, and the three most likely failure modes with exact queries to run against the audit log.

4. failure_playbook.md

Names the fallback model, the circuit-breaker thresholds, the audit-row schema, and the escalation ladder. Written so an engineer who did not build the system can read it once and own a pager shift. Updated in the same PR that changes the graph.

The handoff viewed from your terminal

On handoff day the on-call team runs three commands. The answer to each one has to come from the repo itself, not a vendor UI.

handoff check

Anchor: the audit row shape

The entire 2am test rides on one table. Every node in the orchestration writes exactly one row into audit_rowsbefore it returns. The on-call does not need Grafana, LangSmith, or a vendor trace viewer to answer “what happened at 23:40.” A psql prompt is enough. Two indexes keep the rollup queries under a second even at hundreds of thousands of turns per day.

audit_rows.sql

How the graph emits one row per transition

One hook, installed on every node, writes the audit row on exit. That is the only piece of “framework magic” we ship, and it is thirty lines of Python you can read in one sitting. The on-call query above depends on this being non-optional.

graph.py

The 2am on-call walkthrough, minute by minute

Not a hypothetical. This is the walkthrough we rehearse with the on-call team in the 90-minute handoff session before we leave. The clock is the rubric: alert to named fallback in under 10 minutes, without the person on pager ever needing to read graph.py.

1

00:00 Page fires

PagerDuty incident: rubric_score_p95 below 0.82 for 15 minutes on the Upstate email flow. On-call engineer has never touched graph.py.

2

00:02 Open the runbook

runbook.md is at the repo root. First section is 'pager triggers' keyed to the alert name. The entry for this alert points to one Postgres query against audit_rows and one Grafana panel. No judgment call required yet.

3

00:04 Run the audit query

SELECT node, model, AVG(rubric_score), COUNT(*) FROM audit_rows WHERE ts > NOW() - INTERVAL '30 min' GROUP BY node, model. Result: drafter_primary on Bedrock dropped from 0.89 to 0.71 after 23:40 UTC. All other nodes healthy.

4

00:06 Consult the failure playbook

failure_playbook.md entry 'drafter_primary rubric regression' names the exact fallback: flip pick_primary() to the OpenAI adapter via the feature flag in LaunchDarkly. Named fallback edge, not a guess.

5

00:08 Flip the fallback

Flag flipped. Next audit row shows model=openai-4.6 and rubric_score back to 0.87. Alert clears at 00:11. Incident stays in Severity 3 because the graph had a named fallback edge that did not require a code push.

6

00:12 Hand back to day shift

On-call writes a one-paragraph postmortem quoting the exact audit rows. The day shift has all the evidence in the same repo. No vendor support ticket opened, no cross-team Slack, no screenshare with an integrator.

The wiring that makes the test passable

Three real-world inputs feed one orchestrator, and every transition fans out to a set of artifacts the on-call can reach without us. No ghost runtime sitting between the graph and the evidence.

Inputs → orchestrator → leave-behind artifacts in your repo

Inbound case
Compliance rules
Retrieval
Orchestrator
audit_rows
evals/ nightly
runbook.md

The six-item on-call readiness checklist

Before we sign off on a production release, we walk this checklist with the on-call team. Missing one line means the release is staged but not cut. These are the non-negotiables that make the 2am test passable.

production gate: multi agent ai orchestration

  • Alert names match runbook.md section headers exactly.
  • Every alert maps to one audit-log query and one dashboard panel.
  • Named fallback edge for each model call, not a retry loop over the same model.
  • Audit row is written before a node returns, not in a batched sidecar.
  • Rubric thresholds live in the repo, not in a vendor dashboard admin panel.
  • A postmortem can be written from SELECT statements alone, no UI required.

Vendor control plane vs. your repo

Both shapes can ship a working multi agent AI orchestration system. They diverge the moment you need to own the outcome without the party that built it in the room. This is the same table we walk through on scoping calls when a team is deciding between a hosted platform and an owned build.

FeatureHosted orchestration platformFDE10x owned-repo handoff
Where the orchestration source livesInside a vendor control plane, exported as YAMLgraph.py in your repo, on main, behind your branch protection
Where the eval rubric livesA hosted dashboard you pay per seat forevals/ directory, ragas + case rubric, runs in your GitHub Actions
Where the audit trail livesVendor-managed trace store with 30-day retentionaudit_rows table in your Postgres, joined to your existing BI
How fallback between models is configuredA rules engine in the vendor UIOne conditional_edge in graph.py, model adapters behind pick_primary()
What happens on a pager eventOpen vendor UI, click through runs, escalate to integratorOpen runbook.md, paste one SELECT, flip the named fallback flag
What happens on vendor outageYour orchestration is down until the vendor restoresFallback edge fires, alert clears, vendor outage is a routine incident
12-month exit costRewrite the graph against the next vendor's DSLNone. The graph is already yours. Rotate models, keep the code.

Anchor fact

The leave-behind is the product. Everything else is preamble.

At handoff you will find exactly four new top-level files on main plus an evals/ directory: graph.py or pipeline.py, runbook.md, failure_playbook.md, and audit_rows.sql. Each file compiles, runs, or is consulted by a named rotation. No hidden SaaS. No vendor-managed trace store. Model adapters behind pick_primary() so swapping Bedrock for Vertex is a one-line change. Framework rationale checked in next to the code so the engineer who replaces us can re-evaluate without calling us.

Receipts

The numbers below come from the engagement rubric itself, not a sector benchmark. All four are contract terms, not aspirations. If we miss the week-2 prototype gate, billing pauses; if we ship fewer than four leave-behind files, the engagement is not closed.

0Files checked into main on handoff day
0Audit row per node transition, no exceptions
0 minOn-call diagnosis target from alert to named fallback
0Vendor-managed runtimes in the exit contract

The 0 min target is measured from page-fire to named fallback flipped. The 0 files are enumerated above and checked in the handoff PR. The 0 vendor runtimes is the MSA.

Run the 2am test on your own graph

Sixty-minute scoping call with the senior engineer who would own the build. You leave with a written rubric scored against your current orchestration: which of the four files you already have, which are missing, and what it would cost to close the gap.

Book a call

Multi agent AI orchestration, answered

What exactly is a multi agent AI orchestration system in production?

In production, it is three things: a graph or pipeline definition (nodes, edges, state), an eval harness running on a fixed golden set, and an audit trail that records every node transition. The LLM calls are the easy part. The graph, eval, and audit layer are what make it survive a pager event. At PIAS we ship all three into the client repo on main. No vendor control plane owns any of them after we leave.

What is the 2am on-call test and why does it matter?

The 2am on-call test asks: can a pager-level engineer who did not build the system diagnose a broken turn in under 10 minutes, using only the files in your repo and queries against your own Postgres? Most vendor pages treat 'running in production' as the finish line. We treat it as the start. If your orchestration fails the 2am test, you do not own it; the vendor or the consultancy that built it does, and you will find out on the worst shift of the year. The entire leave-behind rubric is designed to pass this test on day one of the handoff.

What are the four files you hand over at the end of a multi agent AI orchestration engagement?

One, graph.py or pipeline.py (the orchestration source, with model adapters behind a pick_primary / pick_fallback indirection). Two, evals/ with ragas and a case-specific rubric wired into GitHub Actions nightly. Three, runbook.md keyed to on-call, naming every alert, every dashboard, and the exact audit-log query for each pager trigger. Four, failure_playbook.md, naming the fallback model, the circuit-breaker thresholds, and the audit-row schema. All four are checked into main, not attached to a vendor dashboard.

What does the per-transition audit row look like?

It is a single row in an audit_rows Postgres table with: case_id, turn_id, node, model, input_hash, output_hash, rubric_score, latency_ms, fallback_taken, edge_label, ts. One row per node transition. The row is written before the node returns, not in a batched sidecar, so the on-call query is always current. Two indexes: (node, ts DESC) for the 30-minute rollup, and a partial index on rubric_score < 0.82 for the regression query. That shape is enough to answer 'what happened at 23:40' from a psql prompt, no UI required.

Why not use a hosted orchestration platform for this?

Because the handoff is exactly what a hosted platform cannot give you. If the graph lives in a vendor control plane, the artifact that survives the engagement is a license, not a repo. When the vendor raises prices, deprecates an API, or gets acquired, your orchestration is now someone else's roadmap problem. We pick open frameworks (LangGraph, Pydantic AI, custom DAG) precisely so the leave-behind has no SaaS attached. You keep the code. You keep the traces. You keep the bill.

How long does it take to get a multi agent AI orchestration system to a passing 2am on-call test?

Two to six weeks from scoping to a production system that passes the on-call test. Week 0 is a scoping call with a senior engineer, no SOW theater. Week 1 the named engineer is in your GitHub, Slack, and standup. Week 2 a prototype runs in your staging on your data. Week 6 production plus the four leave-behind files, and a 90-minute handoff with the on-call team. If we miss the week-2 gate, billing pauses. That is on the contract.

What happens if our workflow is a scored pipeline, not a stateful conversation?

Then the leave-behind shape is the same, but the orchestration file is pipeline.py with Pydantic AI rather than graph.py with LangGraph. On Monetizy.ai, where the orchestration is retrieve, score, personalize, send, we chose Pydantic AI because the problem is a scored pipeline not a conversation. The audit row schema, the runbook, the failure playbook, and the evals/ directory are identical. The framework is downstream of the problem shape. The handoff rubric is not.

How does model-vendor neutrality fit into this?

The model adapter is a single file (model_adapter.py) with pick_primary() and pick_fallback() returning handles to Bedrock, Vertex, Azure OpenAI, Anthropic, or a local provider. The graph never imports a model SDK directly. Rotating from Bedrock Claude to Vertex Gemini is a one-line change plus an eval pass. No vendor agreement is signed as part of the engagement. You keep the keys. You keep the bill. If a regulator ever asks who the ultimate controller of the model is, the answer is you.

What does the on-call do if both the primary and the fallback model fail?

failure_playbook.md names a degraded mode: the orchestrator short-circuits to a deterministic templated response tagged with a model='deterministic' audit row, and alerts on every case that lands there. This keeps the regulatory audit trail intact (every case has an outcome and a row), protects customers from a silent outage, and gives the day shift a complete list of cases to reprocess once the LLM side is healthy. Two failed providers is an incident; an unanswered customer is a lawsuit.

Do you retrofit this rubric onto an existing multi agent AI orchestration system?

Yes, and about a third of engagements start that way. Week 0 we read your existing graph or chain and score it against the four-file rubric. Most existing systems fail on the audit-row layer (either missing or batched in a way that is useless at 2am) and on runbook.md (which is usually a confluence page that has drifted from the code). We retrofit node hooks and the audit schema first, then the runbook and the failure playbook, and leave the framework choice alone unless it is actively costing you retries. No forced rewrites.