Guide, keyword: multi agent orchestration architecture design

Every architecture guide draws the happy path first.

We draw the failure graph.

Multi agent orchestration architecture design guides publish topology taxonomies (hierarchical, sequential, parallel) and happy path patterns (supervisor, router, hand-off). None of them name the failure modes their diagrams can emit, because naming them forces a design commitment vendors cannot make on your behalf. We can. The first artifact we check into your repo at week 2 is a typed FailureReason Literal in state.py, six named modes, plus a node-by-failure-mode matrix that every conditional edge reads from. The happy path is the residual.

M
Matthew Diakonov
13 min read
4.9from named production clients
6 named FailureReason values on a 400K+ email system
0 generic error branches in graph.py
1 SQL query returns the production failure taxonomy

Grounded in one shipped production system.

Upstate Remedial / 400K+ emails / LangGraph + Postgres

Book the architecture review
state.pygraph.pyaudit.pyFailureReasoncompliance_faillatency_blowrubric_failhandoff_exhaustedschema_invalidmodel_downadd_conditional_edgesrecord_turn()Postgres enumragas rubricpick_primary()pick_fallback()LangGraphPydantic AIOTEL

Why the happy path is the wrong starting point

Open any of the top SERP results for multi agent orchestration architecture design and you get the same three-act structure. Act one: a taxonomy of topologies (hierarchical, sequential, parallel). Act two: a gallery of happy path patterns (supervisor, router, peer-to-peer, hand-off). Act three: a paragraph at the end that says “add retries and guardrails.”

The third act is where production happens. It is also the act the vendor writing the guide cannot finish, because finishing it requires naming the failure modes their framework emits, and naming them invites a comparison they would rather avoid. So the failure graph gets hand-waved and the on-call engineer inherits a system whose architecture document ends where their job starts.

We can finish the third act because we do not sell the framework. Our engagement ends at week 6 and a senior engineer walks out, leaving an eval harness, a runbook, and a CI/CD wiring on main. Whatever we wired into the graph has to be explicit enough for the next engineer to operate without a walkthrough. That is why the design starts with failure.

The five rules of failure-first architecture design

  • Every node in the graph declares at least one named failure mode it can emit. A node with no declared failure mode is a design bug, not a simple node.
  • Failure modes live in one typed Literal in state.py. No per-node strings, no vendor error classes, no Exception subclasses leaking into the router.
  • Every conditional edge reads state.last_failure, not a try/except wrapping the node. Failure is data, not control flow.
  • Every failure mode has a named destination: a fallback node, a retry with a different model, or a terminal escalation edge that writes its own audit row.
  • The Postgres audit table has failure_reason as a NOT NULL enum column. Rows where failure_reason IS NULL are runtime bugs, not happy-path rows.
6

Six named failure modes in the enum. The ceiling is ten; past ten, the enum is leaking internal state. Under four, failures are being squashed into a generic bucket and the on-call cannot triage without reading logs.

PIAS failure-first architecture rubric, applied to the shipped Upstate Remedial system

Anchor file: state.py with the FailureReason Literal

This is the first file we check in. It has zero imports from any orchestration framework. The FailureReason Literal is the spec every downstream file reads. When a new failure mode shows up in production, it gets a new name here first, and the matrix and code follow.

state.py

The node-by-failure-mode matrix

Before we write graph.py, we write this table. Each row is a node. The left column is the failure modes that node can emit. The right column is where each failure routes. This is the committed architecture; the code is just the implementation.

FeatureDestination edge in graph.pyFailure mode the node emits
intakeroute to escalationschema_invalid (payload missing required fields)
compliance_checkroute to escalationcompliance_fail, schema_invalid
drafter_primaryroute to drafter_fallbackmodel_down, latency_blow, schema_invalid
drafter_fallbackroute to escalationmodel_down, latency_blow
reviewerrubric_fail and budget > 0 returns to drafter_fallback; else escalationrubric_fail (score below 0.82), handoff_exhausted
escalationEND(terminal, emits ok or the upstream failure in audit)

Anchor file: graph.py reads last_failure, never a generic error

Every conditional edge names the failure mode it is routing on. There is no try / except in the router. Nodes set state.last_failure and return. The router is pure data routing.

graph.py

How a single transition flows on failure

The sequence below shows what happens when the primary drafter times out. State sets last_failure = “latency_blow”; the audit hook writes one Postgres row; the router reads the field and dispatches to drafter_fallback. No exceptions cross node boundaries.

Failure-first routing, one turn

intakecompliance_checkdrafter_primaryrouterdrafter_fallbackaudit (psql)CaseState (last_failure = None)record_turn(ok, compliance_check)approved -> draftrecord_turn(latency_blow, drafter_primary)state.last_failure = latency_blowroute on last_failurerecord_turn(ok, drafter_fallback)reviewer

Anchor: the Postgres audit schema

The enum that lives in state.py is mirrored one-to-one in Postgres. The audit column is NOT NULL, so a row without a failure_reason is a runtime bug, not a happy-path shape. The on-call gets the production failure taxonomy in a single GROUP BY.

migrations/0001_case_turns.sql

Why this shape beats the happy path shape

What most architecture design guides publish

Boxes and arrows of the happy path: intake -> router -> drafter -> reviewer -> end. Failure is an afterthought paragraph that says 'add retries.' No named failure modes. No enum. No per-edge design artifact.

What we check into your repo in week 2

A typed FailureReason Literal in state.py and a node-by-failure-mode matrix committed next to graph.py. The happy path is whatever is left when every failure edge is wired. The design document is the commit.

Why this survives re-staffing

A new engineer does not have to read the author's head. They read state.py, then graph.py, then run one GROUP BY on case_turns to see what is failing today. No walkthrough required.

Why this survives a framework swap

FailureReason is a Python Literal. The enum is Postgres. Both survive a port from LangGraph to a custom DAG. Only graph.py rewrites. The architecture survives the wiring.

What sources feed the design, what leaves at week 6

The inputs to the failure-first architecture are your problem shape, your policy, and your latency budget. The outputs are four framework-agnostic files plus an operable runbook. The orchestrator is the only file that imports the framework.

Failure-first architecture: sources and leave-behinds

Problem shape
Policy constraints
Latency budget
Ragas + case rubric
FailureReason enum + matrix
state.py
graph.py
case_turns + enum
ARCHITECTURE.md + runbook

The five-step design walk

These are the steps in order. Step one is the hard one; steps two through five are mechanical. Most engagements that get stuck are stuck on step one, not on LangGraph semantics.

Failure-first architecture, in order

1

Enumerate failure modes before drawing a single node

Before we open graph.py, we write the FailureReason Literal in state.py. Six names, no more. Each name maps to an operational response a human can take, not an internal model error.

2

Build the node-by-failure-mode matrix

Name every node. For each node, name the failure modes it can emit and where each failure routes. If a cell says 'escalation' for every failure, the node is too coarse and has to be split.

3

Write graph.py so every conditional edge reads state.last_failure

No try/except in the node bodies. Nodes set state.last_failure and return. The router is the only place that reads it. This keeps node logic pure and routing explicit.

4

Add the Postgres audit schema with failure_reason as a NOT NULL enum

One row per node transition. The enum column is the taxonomy. NULL means a runtime bug, not a happy-path value. The on-call can group by it and get a real taxonomy with one query.

5

Gate the PR on the matrix matching the code

A simple CI test parses state.py for the FailureReason Literal, greps graph.py for the branches, and fails if the matrix is out of date. The matrix is the spec; the graph is the implementation.

What the audit looks like in a live terminal

The audit is two commands. First, grep the repo to prove every branch names a failure mode. Second, run one GROUP BY against the Postgres audit table to see the failure taxonomy over the last 24 hours. This is the design document, running.

architecture-audit.sh

Receipts

The numbers below are from the shipped Upstate Remedial system. They are not benchmark suite numbers; they are measurements of the deployed architecture.

0Named FailureReason values in the enum
0Postgres column that enumerates every production failure
0Generic 'error' branches in graph.py
0K+Emails routed through the failure-first graph at Upstate Remedial

The 0K+ email count runs on a failure-first LangGraph system with the enum schema above live in Postgres; the zero generic error branches claim is verifiable by running the ripgrep commands in the terminal section above.

Anchor fact

One SQL query returns the whole production failure taxonomy.

On the Upstate Remedial system, failure_reason_t is a Postgres enum with seven values: ok, compliance_fail, latency_blow, rubric_fail, handoff_exhausted, schema_invalid, model_down. Every row in case_turns has the column populated. The on-call query is SELECT failure_reason, COUNT(*) FROM case_turns WHERE ts > now() - interval '24 hours' GROUP BY 1. That query is how we resolved a rubric regression in eight minutes at 2 AM in week 4 without waking the person who wrote the drafter. No vendor dashboard, no custom tracing backend, one Postgres table and one enum.

Run the failure-first audit on your existing multi agent system

Sixty minutes with the senior engineer who would own the design. You leave with a draft FailureReason enum and a matrix candidate for your graph, regardless of whether we continue.

Book the architecture review

Multi agent orchestration architecture design, answered

What is failure-first multi agent orchestration architecture design?

A design method where the first artifact you check in is not a happy-path diagram, it is a typed enum of named failure modes the system can emit. We write the FailureReason Literal in state.py before we write a single node body. Then we build a matrix of every node by every failure mode it can emit and name the destination for each one. Only after that matrix is complete do we write graph.py. The happy path drops out as the residual: whatever routing is left when every failure edge is explicitly wired. The benefit is that the design document and the code converge on one source of truth; there is no separate architecture deck that can drift from reality.

Why name failure modes in a Literal instead of raising exceptions?

Exceptions are control flow. Named failure modes are data. If a node raises, the router has to wrap every call in try/except and map error classes to branches, which couples the router to every agent's internal error taxonomy. If a node sets state.last_failure = 'rubric_fail' and returns, the router is pure data routing and the node body stays pure domain logic. The second shape is also trivially auditable: every transition writes one Postgres row with the failure_reason column populated, so the on-call can run SELECT failure_reason, COUNT(*) FROM case_turns GROUP BY 1 and see the whole taxonomy in production. The exception shape hides that information in logs.

How many failure modes should the enum have?

On the production systems we have shipped, six is the typical ceiling. compliance_fail covers policy rejections, latency_blow covers budget exceedences, rubric_fail covers ragas plus case-rubric scoring below threshold, handoff_exhausted covers the handoff_budget ceiling from our MAST-anchored rules, schema_invalid covers tool output that does not validate against its Pydantic model, and model_down covers provider 5xx or timeouts. If the enum grows past ten, the granularity is wrong and the enum is leaking internal state. If it stays at three or four, failure modes are being squashed into 'other' and the on-call cannot triage without reading logs.

How does this differ from a supervisor or router pattern?

Supervisor and router are happy-path patterns. They describe who decides what node runs next when everything is going well. They do not specify what the router reads on failure, how failure is represented, or how the audit log captures it. Failure-first architecture is orthogonal: you can apply it inside a supervisor topology, a sequential pipeline, or a custom DAG. The artifact is the FailureReason enum and the node-by-failure-mode matrix, not the topology. We ship systems that use a LangGraph router topology and systems that use a custom DAG, and both have the same enum shape.

What is the node-by-failure-mode matrix and how do you commit it?

A two-dimensional table: rows are nodes, columns are failure modes. Each cell is either dashes (this node cannot emit that mode) or the name of the destination node the graph routes to. We check it in as a markdown table next to graph.py with the title ARCHITECTURE.md. A simple CI test parses the FailureReason Literal from state.py, greps graph.py for every 'state.last_failure ==' branch, and fails the PR if any cell in the markdown table does not match the code. The matrix is the spec; graph.py is the implementation; CI enforces equality.

How does failure-first design change the audit schema?

It promotes failure_reason to a NOT NULL enum column on the case_turns table, with 'ok' as the success value. Every node transition writes exactly one row, success or failure. That gives the on-call engineer a production failure taxonomy in one SQL query instead of a grep through JSON logs. The schema also adds a partial index WHERE failure_reason <> 'ok' so failure rates are cheap to compute per case_id, per node, per hour. We have watched a 2 AM incident resolve in eight minutes because the first query the on-call ran returned 'rubric_fail: 4200' and pointed at a drafter regression, not a routing bug.

Does failure-first design slow down the week 2 prototype?

No. It shortens it. The enum and the matrix are a one-day exercise on day three of the engagement, and they eliminate most of the week-three-onward thrash because every PR has an unambiguous place to declare new failure behavior. On the Upstate Remedial system we hit the week 2 prototype rubric on day 13, and the week-six production cutover required zero rearchitecting of the graph: the only changes in weeks three to six were new nodes and new enum values, never a rewrite of the routing.

Do you ship the architecture design itself as a deliverable?

Yes. At week 2 the leave-behind-in-progress includes four files: state.py with the FailureReason Literal, graph.py with the conditional edges reading it, the Postgres migration creating failure_reason_t as an enum and case_turns as the audit table, and ARCHITECTURE.md with the node-by-failure-mode matrix. At week 6 we add evals/ (ragas plus case rubric in CI), runbook.md keyed to your on-call rotation naming every failure_reason and the first query to run, and RATIONALE.md explaining which framework decisions are load-bearing and which are replaceable. The design document is the code. We do not ship separate decks.

Can you apply failure-first design to an existing multi agent system we already have?

Yes, that is most of the week 0 refactor engagements we run. We read your current graph, grep for exception handlers, and propose a FailureReason enum that captures the modes your code already implicitly raises. The first PR replaces try/except sites with state.last_failure assignments and introduces the Postgres enum column. The second PR flips the conditional edges to read from it. Usually this is a two-week delta, not a rewrite, and it leaves your current framework in place. The goal is to make failure visible in one query, not to move you to a new framework.