Guide, keyword: multi agent llm orchestration

Vendors sell complexity. We publish the ceiling.

Q: What exactly is multi agent LLM orchestration?

Multi agent LLM orchestration is the control layer that coordinates multiple LLM-driven nodes (agents) toward one outcome. In production it is three artifacts: a graph or pipeline file (nodes, edges, typed state), an eval harness running on a fixed golden set, and an audit trail of every node transition. At PIAS we ship the orchestration as source code in your repo on main, with a published ceiling on how many agents are allowed to talk to each other. That ceiling is 8 edges, a 3-turn handoff budget encoded as a typed state field, and no more than 2 models in the critical path.

Q: Why publish a numerical ceiling at all? Is this not just hard-coding opinions?

The MAST failure taxonomy (Cemri et al., 2025, arXiv 2503.13657) catalogued 14 failure modes in production multi-agent LLM systems, with inter-agent misalignment as the single most common and infinite handoff loops in the top three. Those failure modes scale with the number of edges and the number of possible agent-to-agent transitions. A numerical ceiling caps the failure surface directly. Prose rules like 'write clearer specifications' do not; a Pydantic field with ge=0, le=3 does. The ceiling is an opinion and we defend it as one.

Q: What is the handoff_budget field and where exactly does it live?

It is a Pydantic-constrained int field on the state object (CaseState), defaulted to 3, bounded between 0 and 3. Every production multi agent LLM graph we ship has it. It is decremented in one place: the after-hook record_transition in audit.py, and only when the transition is agent-to-agent (both source and destination are in AGENT_NODES). When it reaches 0, a conditional edge routes to a deterministic escalation node that writes a budget_exhausted audit row and returns a human-routing payload. There is no retry after the budget exhausts; that is the whole point of the field.

Q: Why 3 turns specifically? Not 2, not 5?

Three is the smallest number that allows a primary-then-fallback-then-review pattern to complete under normal conditions and still catches a loop before the 4th hop. On Upstate Remedial Management, where regulatory exposure on a wrong email is the failure cost, we shipped with 3. On Monetizy.ai, where the failure cost is a reschedule of an email campaign, we also shipped with 3. Two starves the fallback path on legitimate primary failures. Five lets loops accumulate enough context to look like progress. Three is an empirical choice from shipped systems, not a theorem. It is also a single number in one config file, so your team can tighten it to 2 on any engagement if the domain calls for it.

Q: How does this compare to max_iterations in AutoGen or CrewAI?

max_iterations as typically configured is a retry ceiling on a single agent, usually set in a prompt or a runtime config. It does not travel with the state object across handoffs, it does not enforce determinism on exhaustion, and it does not write an audit row that an on-call engineer can query. handoff_budget is a typed field on the state object that every node reads and one hook decrements. When it exhausts, a named node runs, a row lands in Postgres, and the alert can query for budget_exhausted = TRUE. max_iterations is a loop break; handoff_budget is a contract term.

Q: What happens on a budget_exhausted event in production?

The escalation node runs. It is deterministic and has no LLM call. It writes a row to audit_rows with budget_exhausted=TRUE, emits a CloudWatch or Datadog event on the budget_exhausted metric, pages the on-call if the metric exceeds its threshold, and returns a response payload that routes the case to a human queue (for Upstate Remedial Management: a Zendesk ticket with the case context). The regulatory audit trail stays intact (every case has a row) and the customer is never left with a silently-hallucinated outcome. Two failed models is an incident; an unanswered customer is a lawsuit.

Q: Would a ceiling not stop my team from doing real agentic work?

A ceiling on edges does not stop agentic work; it forces the design to allocate edges deliberately. Across five shipped systems, the edge counts we ended up with were 4, 6, 8, 5, and 7. None exceeded 8. Two of those systems (OpenLaw, PriceFox) do real multi-step reasoning with retrieval, scoring, and critique. The work fits under the ceiling because the graph is designed around named outcomes per node rather than open-ended agent chatter. If your problem shape genuinely needs a 15-edge graph, the correct move is to split the problem into two orchestrators with a queue in between, not to raise the ceiling.

Q: Do the same ceilings apply to non-LangGraph frameworks?

Yes. We use LangGraph when the state is a typed conversation, Pydantic AI when the state is a scored pipeline (Monetizy.ai shape), and a custom DAG when the state is a tree (OpenArt scene graph). In every case the state object is a Pydantic BaseModel, handoff_budget is a field on it, the decrement lives in one hook, and a conditional edge or gate routes to a deterministic escalation node on 0. The framework choice is downstream of the problem shape. The ceiling is not.

Q: How is the ceiling verified at CI time, not just at runtime?

Two checks in .github/workflows/evals.yml. One, a graph_walker script that imports graph.py, walks the compiled graph, and asserts that len(graph.edges) <= 8 and max_depth(graph) <= 4. Two, a unit test that constructs a CaseState with handoff_budget=0 and asserts the router routes to escalation. Both checks are required status checks on the main branch, so a PR that widens the graph past the ceiling cannot merge. The ceiling is a branch-protection rule, not a style guide.

Q: How do you pick between primary and fallback models, and why only 2?

The model_adapter.py file exports pick_primary() and pick_fallback(). Primary and fallback are chosen per engagement based on the provider mix the client already runs (Bedrock + OpenAI, Vertex + Anthropic, Azure + Bedrock). The cap at 2 in the critical path exists because a 3rd model does not add resilience that a primary-plus-fallback with the right rubric does not already cover, but it does add a 3rd vendor contract, a 3rd set of rate limits, and a 3rd audit surface. If a customer needs provider diversity for regulatory reasons (EU data residency, SOC 2 scope), the 3rd model goes in a separate graph with a queue, not into the critical path of the same orchestrator.

Every production multi agent LLM orchestration graph we have shipped lives under the same published ceiling. No more than 8 edges. A handoff_budget: int = 3 field on the typed state object. No more than 2 models in the critical path. The ceiling is there because the MAST failure taxonomy put inter-agent misalignment at the top of the list, and every extra edge is one more place for a system at 2am to drift into a loop it cannot exit.

Matthew Diakonov, Written with AI

Published April 21, 202614 min read

4.9from five named production systems, edge counts 4 / 6 / 8 / 5 / 7

Numerical design constraints, not prose

handoff_budget on the Pydantic state object

Deterministic escalation on budget exhaustion

The ceiling, not the capability

Multi agent LLM orchestration, constrained so an on-call can still read it

Vendors sell complexity. We publish the ceiling.

No more than 8 edges in a production graph.

handoff_budget: int = 3 on the state object.

No more than 2 models in the critical path.

If the budget hits 0, the graph terminates.

0:00 / 0:08

Edges, models, handoffs, depth. Four numbers. Checked in main.

Named senior engineer. Your repo. Your Postgres. Your on-call.

edges <= 8handoff_budget = 3models_in_path <= 2depth <= 4state = Pydantic BaseModelhandoff_budget: intrecord_transition()budget_exhausteddeterministic fallbackPostgres audit rowragas + case rubricLangGraph conditional_edgeMAST 1.1 (inter-agent misalignment)MAST 2.3 (infinite loop)MAST 3.2 (premature termination)

What the SERP never publishes

Search multi agent llm orchestration and the top ten results will walk you through definitions, framework pickers, failure-mode taxonomies, and at least one “we built our multi-agent research system” engineering post. None of them give you a number. None say “here is the ceiling we refuse to cross.” The closest thing, the MAST taxonomy paper, catalogues 14 failure modes but leaves the remediation to the reader.

There is a reason for the silence. A vendor that sells an orchestration runtime cannot publish a numerical ceiling on edges or models, because the business model is unbounded complexity. Publishing the ceiling would be publishing the exit path off the product. We do not sell a runtime; we sell a named senior engineer who ships the graph into your repo and leaves. The ceiling is publishable because it is the artifact, not the runtime.

<= 8

“Edges in every production multi agent LLM orchestration graph PIAS has shipped. The counts by client are 4, 6, 8, 5, and 7. None has exceeded 8. None has required more than 2 models in the critical path. The ceiling is a contract term, not a coding style.”

PIAS engagement rubric, verified against Monetizy.ai, Upstate Remedial Management, OpenLaw, PriceFox, OpenArt

The four numbers, spelled out

These live at the top of every production graph.py or pipeline.py we ship, encoded either as a comment block the CI reads or as explicit asserts in a graph-walker test. Your team can tighten any of them on a branch; raising any of them requires an engineer-of-record signoff, not a PR description.

0Max edges in a production multi agent LLM graph we ship

0handoff_budget default on the typed state object

0Models in the critical path, primary plus named fallback

0Max depth from entry to any END, asserted at CI time

0 edges is the structural cap. 0 is the default handoff_budget on CaseState. 0 models in the critical path, primary plus named fallback. 0 is the max depth from entry to any END, asserted at CI time by a graph walker script.

Anchor: the typed state field that enforces it

The whole ceiling rides on one Pydantic field and one boolean. The field is bounded between 0 and 3 at the type layer, so an agent that attempts to set it to something else fails validation before a single LLM call is made. This is the part no vendor page can copy; it is a rule on a model the on-call can read in one minute.

state.py

The conditional edge that implements the rule

One function, budget_gate. Called on every edge that could loop back into the agent side of the graph. When handoff_budget <= 0, the route is not a retry, it is a named deterministic escalation node. The ceiling becomes a one-line code fact, not a prompt.

graph.py

The single place the budget decrements

There is exactly one decrement site in the codebase. If you grep for handoff_budget -=you find this file and no other. The decrement happens inside the same after-hook that writes the Postgres audit row, so “the budget changed” and “the audit row landed” are the same event. No sidecar, no batched flush, no way for a production incident to look fine to the on-call because the decrement queued up.

audit.py

Watch the budget exhaust in a real incident

The sequence below is the exact trace of a case where both drafter passes scored below the rubric. Each reviewer-to-drafter loop costs one budget unit. When the third attempt fails, the graph exits through the deterministic escalation node rather than looping a fourth time. A human picks up the case. The audit trail shows exactly where the budget went.

case with budget exhaustion, 9 transitions

The shape we refuse versus the shape we ship

Toggle to see both sides. The “flexible” shape on the left is what most multi agent LLM orchestration tutorials endorse and what most first-cut production graphs look like when we are called in to retrofit them. The shape on the right is what we actually leave behind on main.

The graph topology decision

# The "flexible" shape most multi agent LLM orchestration guides endorse. # Looks powerful. Fails at 2am. graph = StateGraph(State) graph.add_node("planner", planner) graph.add_node("researcher", researcher) graph.add_node("synthesizer", synthesizer) graph.add_node("critic", critic) graph.add_node("specialist_legal", specialist_legal) graph.add_node("specialist_finance", specialist_finance) graph.add_node("specialist_compliance", specialist_compliance) graph.add_node("specialist_retrieval", specialist_retrieval) graph.add_node("coordinator", coordinator) graph.add_node("executor", executor) # 14+ edges. Critic can call any specialist. Coordinator can re-route. # A path through the graph is not enumerable. An eval rubric cannot cover it. # At 2am on-call, "what happened" is a LangSmith trace reader problem.

10+ agent nodes, no enumerable path set
Critic can re-route to any specialist
No typed cap on handoff depth
Eval rubric cannot cover every path
On-call at 2am is reading a LangSmith trace

The wiring, end to end

Three ingress lanes feed one orchestrator. Every node transition fans out to three leave-behind artifacts in your repo: a Postgres audit row that includes the current handoff_budget, the nightly evals that assert the ceiling, and the runbook entry keyed to the budget_exhausted alert. There is no ghost runtime between the graph and the evidence.

Ingress -> capped orchestrator -> leave-behind artifacts

Six reasons the ceiling is not arbitrary

Every number above is tied to a named failure class from the MAST taxonomy or to a shipped-system constraint. If any one of these stops being true, the ceiling moves.

why we published these numbers

Inter-agent misalignment is the #1 production failure mode in the MAST taxonomy. Capping edges caps the misalignment surface area.
Infinite handoff loops (A -> B -> C -> A) are MAST 2.3. A decrementing counter short-circuits the loop, a prompt cannot.
Premature termination is MAST 3.2. The ceiling forces every terminal state to be a named node, not an LLM judgment call.
Cost escalation: workflows that cost $0.50 in staging have hit $50K/month at 100K executions. A hard agent cap caps the bill.
On-call readability: 8 edges fit on one page. 23 edges require a PDF. At 2am no one reads a PDF.
Eval harness tractability: a rubric has to cover every path. Paths grow combinatorially with edges. 8 edges = 4 to 6 paths to score nightly.

The handoff viewed from your terminal

On handoff day, three commands answer three questions. Is the budget decrement in one place. Does the graph pass the ceiling asserts. Does the production audit log already show budget exhaustion happening cleanly. Every answer comes out of the repo and your Postgres, not from a vendor UI.

ceiling check

Vendor-runtime orchestration vs. owned-repo ceiling

Both shapes can ship a multi agent LLM orchestration system that technically runs. They diverge the moment you want to hold a specific number to the wall. Vendors cannot publish numerical ceilings; the ceilings would double as exit plans. We can, because the thing we hand over is the ceiling.

Feature	Hosted orchestration runtime	FDE10x owned-repo ceiling
Cap on number of edges in a production graph	None published. "As many as the use case requires."	8 edges. Fits on one screen. If the graph needs more, the problem was scoped wrong.
Handoff cap between agents	Implicit. Retry loops with a max_iterations prompt setting.	handoff_budget: int = 3 on the typed state object. Decrements in code, not in a prompt.
Cap on models in the critical path	None. More providers is usually sold as more resilience.	2. Primary and named fallback. Rotation is a one-line change in model_adapter.py.
Cap on graph depth before forced termination	None. Deep trees of sub-agents.	Depth <= 4 from entry to any END. Enforced in a test that walks the graph at CI time.
What happens when the cap is hit	Retry. Loop. Hope the next LLM call goes better.	Deterministic escalation node writes a budget_exhausted audit row and returns a human-route payload.
Who owns the ceiling after the engagement ends	Vendor runtime; you cannot tighten it without re-subscribing.	Three numbers in a config file you own on main. Your on-call can tighten them today.
How the ceiling is published	Not published. The business model sells unbounded complexity.	In writing, in the MSA, in your repo's README, in this guide.

Anchor fact

One field on the state object stops most MAST failures.

handoff_budget: int = Field(default=3, ge=0, le=3). It is declared on a Pydantic model. It is decremented in exactly one function, record_transition, which is the same function that writes the Postgres audit row. A conditional edge routes to a deterministic escalation node the moment the field is 0. There is no retry after zero; a human owns the case from there. The SERP can describe infinite-handoff loops in prose; we put the counter in the type system.

Score your graph against the ceiling

Sixty-minute scoping call with the senior engineer who would own the build. You leave with a written score of your current multi agent LLM orchestration against the four ceiling numbers: edges, handoff budget, models in path, depth. Which you already pass, which are breached, what it would cost to close the gap.

Multi agent LLM orchestration, answered

What exactly is multi agent LLM orchestration?

Multi agent LLM orchestration is the control layer that coordinates multiple LLM-driven nodes (agents) toward one outcome. In production it is three artifacts: a graph or pipeline file (nodes, edges, typed state), an eval harness running on a fixed golden set, and an audit trail of every node transition. At PIAS we ship the orchestration as source code in your repo on main, with a published ceiling on how many agents are allowed to talk to each other. That ceiling is 8 edges, a 3-turn handoff budget encoded as a typed state field, and no more than 2 models in the critical path.

Why publish a numerical ceiling at all? Is this not just hard-coding opinions?

The MAST failure taxonomy (Cemri et al., 2025, arXiv 2503.13657) catalogued 14 failure modes in production multi-agent LLM systems, with inter-agent misalignment as the single most common and infinite handoff loops in the top three. Those failure modes scale with the number of edges and the number of possible agent-to-agent transitions. A numerical ceiling caps the failure surface directly. Prose rules like 'write clearer specifications' do not; a Pydantic field with ge=0, le=3 does. The ceiling is an opinion and we defend it as one.

What is the handoff_budget field and where exactly does it live?

It is a Pydantic-constrained int field on the state object (CaseState), defaulted to 3, bounded between 0 and 3. Every production multi agent LLM graph we ship has it. It is decremented in one place: the after-hook record_transition in audit.py, and only when the transition is agent-to-agent (both source and destination are in AGENT_NODES). When it reaches 0, a conditional edge routes to a deterministic escalation node that writes a budget_exhausted audit row and returns a human-routing payload. There is no retry after the budget exhausts; that is the whole point of the field.

Why 3 turns specifically? Not 2, not 5?

Three is the smallest number that allows a primary-then-fallback-then-review pattern to complete under normal conditions and still catches a loop before the 4th hop. On Upstate Remedial Management, where regulatory exposure on a wrong email is the failure cost, we shipped with 3. On Monetizy.ai, where the failure cost is a reschedule of an email campaign, we also shipped with 3. Two starves the fallback path on legitimate primary failures. Five lets loops accumulate enough context to look like progress. Three is an empirical choice from shipped systems, not a theorem. It is also a single number in one config file, so your team can tighten it to 2 on any engagement if the domain calls for it.

How does this compare to max_iterations in AutoGen or CrewAI?

max_iterations as typically configured is a retry ceiling on a single agent, usually set in a prompt or a runtime config. It does not travel with the state object across handoffs, it does not enforce determinism on exhaustion, and it does not write an audit row that an on-call engineer can query. handoff_budget is a typed field on the state object that every node reads and one hook decrements. When it exhausts, a named node runs, a row lands in Postgres, and the alert can query for budget_exhausted = TRUE. max_iterations is a loop break; handoff_budget is a contract term.

What happens on a budget_exhausted event in production?

The escalation node runs. It is deterministic and has no LLM call. It writes a row to audit_rows with budget_exhausted=TRUE, emits a CloudWatch or Datadog event on the budget_exhausted metric, pages the on-call if the metric exceeds its threshold, and returns a response payload that routes the case to a human queue (for Upstate Remedial Management: a Zendesk ticket with the case context). The regulatory audit trail stays intact (every case has a row) and the customer is never left with a silently-hallucinated outcome. Two failed models is an incident; an unanswered customer is a lawsuit.

Would a ceiling not stop my team from doing real agentic work?

A ceiling on edges does not stop agentic work; it forces the design to allocate edges deliberately. Across five shipped systems, the edge counts we ended up with were 4, 6, 8, 5, and 7. None exceeded 8. Two of those systems (OpenLaw, PriceFox) do real multi-step reasoning with retrieval, scoring, and critique. The work fits under the ceiling because the graph is designed around named outcomes per node rather than open-ended agent chatter. If your problem shape genuinely needs a 15-edge graph, the correct move is to split the problem into two orchestrators with a queue in between, not to raise the ceiling.

Do the same ceilings apply to non-LangGraph frameworks?

Yes. We use LangGraph when the state is a typed conversation, Pydantic AI when the state is a scored pipeline (Monetizy.ai shape), and a custom DAG when the state is a tree (OpenArt scene graph). In every case the state object is a Pydantic BaseModel, handoff_budget is a field on it, the decrement lives in one hook, and a conditional edge or gate routes to a deterministic escalation node on 0. The framework choice is downstream of the problem shape. The ceiling is not.

How is the ceiling verified at CI time, not just at runtime?

Two checks in .github/workflows/evals.yml. One, a graph_walker script that imports graph.py, walks the compiled graph, and asserts that len(graph.edges) <= 8 and max_depth(graph) <= 4. Two, a unit test that constructs a CaseState with handoff_budget=0 and asserts the router routes to escalation. Both checks are required status checks on the main branch, so a PR that widens the graph past the ceiling cannot merge. The ceiling is a branch-protection rule, not a style guide.

How do you pick between primary and fallback models, and why only 2?

The model_adapter.py file exports pick_primary() and pick_fallback(). Primary and fallback are chosen per engagement based on the provider mix the client already runs (Bedrock + OpenAI, Vertex + Anthropic, Azure + Bedrock). The cap at 2 in the critical path exists because a 3rd model does not add resilience that a primary-plus-fallback with the right rubric does not already cover, but it does add a 3rd vendor contract, a 3rd set of rate limits, and a 3rd audit surface. If a customer needs provider diversity for regulatory reasons (EU data residency, SOC 2 scope), the 3rd model goes in a separate graph with a queue, not into the critical path of the same orchestrator.

How long does it take to retrofit these ceilings onto an existing multi agent LLM orchestration system?

Two to four weeks for most systems we have retrofitted. Week 1 is reading the existing graph, counting edges, scoring it against the ceiling, and writing the CI asserts so further growth is capped. Week 2 is adding handoff_budget to state, moving the decrement into one hook, and wiring the deterministic escalation node. Week 3 is retrofitting the audit row schema if it is not already there. Week 4 is the runbook and the 90-minute on-call handoff. If the existing graph is already under 12 edges, it is usually 2 weeks. If it is a 30-edge tree of sub-agents, the honest answer is to split it and that is closer to 6 weeks.

Adjacent guides

Vendors sell complexity. We publish the ceiling.

What the SERP never publishes

The four numbers, spelled out

Anchor: the typed state field that enforces it

The conditional edge that implements the rule

The single place the budget decrements

Watch the budget exhaust in a real incident

The shape we refuse versus the shape we ship

The graph topology decision

The wiring, end to end

Ingress -> capped orchestrator -> leave-behind artifacts

Six reasons the ceiling is not arbitrary

The handoff viewed from your terminal

Vendor-runtime orchestration vs. owned-repo ceiling

One field on the state object stops most MAST failures.

Score your graph against the ceiling

Multi agent LLM orchestration, answered

More on shipping production multi agent systems

The 2am on-call test

Framework selection matrix

Claude Code subagent orchestration

Comments ()

Vendors sell complexity. We publish the ceiling.

What the SERP never publishes

The four numbers, spelled out

Anchor: the typed state field that enforces it

The conditional edge that implements the rule

The single place the budget decrements

Watch the budget exhaust in a real incident

The shape we refuse versus the shape we ship

The graph topology decision

The wiring, end to end

Ingress -> capped orchestrator -> leave-behind artifacts

Six reasons the ceiling is not arbitrary

The handoff viewed from your terminal

Vendor-runtime orchestration vs. owned-repo ceiling

One field on the state object stops most MAST failures.

Score your graph against the ceiling

Multi agent LLM orchestration, answered

More on shipping production multi agent systems

The 2am on-call test

Framework selection matrix

Claude Code subagent orchestration

Comments (••)

Comments ()