Guide, keyword: langchain multi agent orchestration
Tutorials teach you what to add. We ship by what we delete.
Search langchain multi agent orchestration and every top result walks you through how to build a supervisor, a swarm, or a Command(goto=...) handoff. None of them say which of those idioms a senior engineer removes from a client repo before a v1 tag. This page is that list. Five patterns, one grep script, one Pydantic replacement each. The five most common reasons a LangGraph graph looks clever in the PR and broken at 2am.
Five grep patterns. Five Pydantic replacements. One shell script that fails the build if any of them come back.
Senior engineer embedded in your repo. Your Postgres. Your CI.
Why the SERP is full of additions and empty of subtractions
The first page of results for langchain multi agent orchestration is uniformly instructive: here is how to add a supervisor, here is how to build a swarm, here is how to use Command(goto=...) for handoffs, here is how to wire interrupt() for human-in-the-loop. All real features. All documented. All correct on turn one of a greenfield repo.
What no one publishes: what those features look like on turn 400, three weeks after the PoC shipped, when a worker has restarted twice and the p95 prompt is 18k tokens of stale conversation. A framework vendor cannot publish that list because the items on it are things the framework encourages you to use. We can, because the product is not a framework, it is a senior engineer inside your repo for six weeks and the artifact they leave behind.
“LangGraph idioms we grep out of every client repo in week 1. The list was 11 entries long as of mid-2025. Six moved into fixed framework behavior or stopped repeating. Five remain. Every engagement starts by running a shell script that counts matches for each one and scoring the codebase against zero.”
PIAS engagement checklist, applied across Monetizy.ai, Upstate Remedial Management, OpenLaw, PriceFox, OpenArt
The numbers that live in every post-refactor graph
These four are not a style preference. They are a contract term that the eval harness and graph walker both enforce at CI time. Any PR that violates them is a failing required status check, not an argument in review comments.
0 patterns removed. 0 edges in the final graph. 0 models in the critical path. 0 file where graph topology lives.
Anchor: the five patterns, named and explained
This is the part of the page you can take back to your repo today. Grep for each string. If any count is above zero in src/, you have a known LangGraph production risk on main. The replacement is named next to each.
the deletion list, in engagement order
1. Command(goto="node_x") returned from an agent
An agent function returns langgraph.types.Command and the framework quietly adds an edge that is not in the static graph. Your CI graph walker counts edges by reading graph.py and misses every runtime-only edge. A PR that adds three Command(goto=) returns looks like a 0-edge diff on the review screen.
2. interrupt() called mid-node for human review
Tutorials love interrupt() because it feels elegant. In production it means "pause this node, wait for a frontend that may or may not exist, then resume with injected state." Resume paths are the single hardest thing to evaluate, because half the turn already ran and state is now partly stale.
3. MessagesState with unbounded add_messages
The built-in MessagesState plus add_messages reducer grows forever. On day 30 the p95 prompt is 18k tokens of conversational soup. Evals on turn-1 prompts still pass; evals on turn-12 prompts cannot even be constructed because no two cases have the same shape.
4. with_structured_output() called inside a supervisor node
When the supervisor is the one calling with_structured_output, you have coupled routing to a single LLM's function-calling API. Swapping vendors becomes a rewrite. And the supervisor's output class usually has a free-text reason field, which means routing is effectively controlled by a string the eval cannot grade.
5. checkpointer=MemorySaver() in anything that touches prod
We still see this in client repos where a LangGraph PoC graduated to 'just run it on staging.' MemorySaver loses every thread the moment the worker restarts. On the next deploy, half-complete cases evaporate and the on-call has no trace of where they went.
The state shape we leave behind
Typed. Bounded. No MessagesState. No free-text routing. The Field(max_length=12) on history is the single line that forces a compaction function to exist in the codebase.
The routing pattern we leave behind
Every edge out of the supervisor is enumerable by reading one pure function. The graph walker in CI can draw the complete graph from graph.py alone, no runtime execution required. A reviewer who opens a PR can count edges in twenty seconds.
The shell script that runs on day 1 and every PR after
Forty lines of bash. Five rg calls, each expected to return zero. If any count is above zero in a PR, the required status check fails and the PR cannot merge. The cheapest, most-read file in the repo.
The day-1 scan, from your terminal
Three commands answer three questions on handoff day. Did we actually remove the five patterns. Does the graph pass the ceiling asserts. Does the eval harness still pass on the golden set. Every output comes from your repo and your Postgres.
What a single case looks like going through the new graph
A case comes in, the supervisor routes it to the primary drafter, the scorer rejects it, the supervisor re-routes to the fallback drafter, the scorer rejects again, and the turn budget runs out. The graph exits through the deterministic human review node, not through a third LLM call and a silent close.
one case, two failed rubrics, budget exhausted, human takes it
Before and after, in the same graph.py
The left side is recognizable from most LangGraph tutorials and from every client repo we have opened. The right side is what is on main after week 1. Same framework, same number of logical agents, totally different operational surface area.
supervisor routing in a LangChain multi agent graph
# What LangChain tutorials show you # (and what we find in every client repo we open) @tool def research(query: str) -> str: ... def supervisor(state: MessagesState) -> Command: # Routing lives in an LLM output. An unprintable string. decision = llm.with_structured_output(Route).invoke(state["messages"]) return Command(goto=decision.next, update={"messages": [decision.note]}) builder = StateGraph(MessagesState) builder.add_node("supervisor", supervisor) builder.add_node("researcher", researcher) builder.add_node("writer", writer) # No add_edge calls for supervisor. # Edges exist only at runtime via Command(goto=). # graph.draw() shows a supervisor with zero outgoing edges. # CI graph walker cannot enforce a ceiling. app = builder.compile(checkpointer=MemorySaver()) # MemorySaver in prod. Worker restart = lost threads.
- Command(goto=) creates runtime-only edges
- MessagesState grows without bound
- with_structured_output couples routing to one vendor
- MemorySaver loses threads on worker restart
- graph_walker sees 0 outgoing edges from supervisor
Your inputs, the capped graph, your leave-behind
Three ingress lanes feed one orchestrator. Every transition fans out to three leave-behind artifacts: a Postgres audit row, the nightly eval pass, and a runbook entry. There is no vendor runtime between your code and the evidence.
Inputs -> cleaned LangGraph orchestrator -> your repo
Tutorial LangGraph vs. production LangGraph, feature by feature
Same framework. Different subset of its API surface. The left column is what you will find in the top 10 results for this keyword. The right column is what sits on your main branch after we hand off.
| Feature | Tutorial LangGraph (SERP default) | Production LangGraph (PIAS leave-behind) |
|---|---|---|
| Handoff API used between agents | Command(goto="other_agent") returned from an agent callable. | add_conditional_edges(node, router_fn). Router is a pure function over state. Graph walker counts every edge. |
| Human-in-the-loop | interrupt() in the middle of a node, resume path depends on frontend. | A named human_review node, deterministic, writes awaiting_human=TRUE. Resume creates a new run. |
| Conversational state | MessagesState with add_messages. Grows without bound. Prompt shape drifts with wall-clock. | CaseState(BaseModel) with history: list[Turn] bounded by max_length. Compaction function is mandatory. |
| Routing decision carrier | Supervisor calls with_structured_output(), couples routing to one vendor's function calling. | Supervisor returns RouteDecision with an enum next_node. model_adapter is the only file that talks to vendors. |
| Checkpointing for long runs | MemorySaver in PoC, never replaced. Worker restart loses every in-flight case. | PostgresSaver against your existing Postgres. Thread IDs map to case_id. One SELECT joins checkpoints and audit rows. |
| CI enforcement | Style guide in a Notion page. New PR adds a new Command(goto=...) and no one notices. | langgraph_lint.sh runs on every PR. Match count must be 0 for each pattern. Required status check on main. |
| Who owns the rules after the engagement | Vendor docs. They change with framework releases. | Five lines in a shell script and one pydantic file in your repo. Your on-call can tighten them today. |
The pre-merge checklist for any LangGraph PR
If a PR on the graph repo wants to merge to main, these seven statements have to be true. Six of them are machine-checked by CI. The seventh is the runbook entry, which is a file a human reads during an incident.
PR-level sanity checks
- langgraph_lint.sh returns 0 matches for every pattern in scripts/ and src/ excluding tests/.
- graph_walker asserts len(graph.edges) <= 8 and max_depth(graph) <= 4, reading graph.py as source of truth.
- Every node that mutates state has a corresponding row in the audit_rows table keyed by (case_id, turn_id).
- CaseState has a history field with Field(max_length=...) and a compact_history() function that runs before every LLM send.
- Supervisor returns RouteDecision, never Command. The word Command does not appear in src/ for control flow.
- The eval harness replays RouteDecision outcomes from a golden set and asserts next_node distribution has not drifted more than 10 percentage points.
- runbook.md has a budget_exhausted section that names the on-call alert, the Postgres query, and the Zendesk queue the case lands in.
Anchor fact
Five ripgrep calls. Every count must be zero before v1.
On day 1 of a LangGraph engagement we run scripts/langgraph_lint.sh. Five patterns: Command(goto=, interrupt(, from langgraph.graph import MessagesState, with_structured_output( inside a supervisor file, MemorySaver( on any code path that reaches prod. Plus a guard that add_messages only appears on fields with a max_length. Every count must hit zero before we tag v1. The script is a required status check on main, so it stays zero after we leave. The SERP can keep telling you to add these patterns. We will keep deleting them.
Grep your LangGraph repo with us
Sixty-minute scoping call with the senior engineer who would own the week-1 cleanup. You leave with a written match-count for each of the five patterns in your current LangChain multi agent orchestration codebase, and an estimate for what it takes to get every count to zero.
Book a call →LangChain multi agent orchestration, answered
What is LangChain multi agent orchestration, in one paragraph?
LangChain multi agent orchestration is the practice of coordinating several LLM-driven nodes (agents) inside one directed graph, usually via LangGraph. The graph holds a typed state object, each node reads and writes that state, and conditional edges route based on state. In production it is three artifacts: graph.py (nodes, edges, typed state), an eval harness on a fixed golden set, and an audit trail of every node transition. The hard part is not building the first graph; it is keeping edges, models, and handoff depth bounded so the on-call can reason about what happened at 2am.
Why remove Command(goto="...") handoffs? It is in the LangGraph docs.
Command(goto=...) lets any agent return a runtime-only edge. The edge is not in the static graph, which means graph.draw() misses it, a CI graph walker that reads graph.py misses it, and a PR that adds three new Command(goto=) returns looks like a no-op on the review screen. We replace it with add_conditional_edges(node, router_fn) where router_fn is a pure function over state. Every edge becomes string-searchable. The graph walker in CI becomes the source of truth on topology again. For LangChain multi agent orchestration graphs that will run in production for months, that is non-negotiable.
What is wrong with interrupt() for human-in-the-loop flows?
interrupt() pauses a node mid-execution, waits for an external resume, then continues from the interrupted point with injected state. In production this creates three problems: the node is half-executed so state is partly stale, the resume path depends on a frontend that may have shipped a breaking change since the interrupt, and your eval harness cannot replay an interrupted run because it has no concept of 'partial turn.' Our replacement is a deterministic human_review node. The graph enters it, writes awaiting_human=TRUE to the audit row, and terminates. A separate endpoint (POST /cases/{id}/resume) starts a new run with the human decision already in state. One turn per transition, always.
Why not use MessagesState? It is the LangGraph default.
MessagesState plus the add_messages reducer is a list that grows forever. For a 12-turn conversation it is fine. For a case that comes back three weeks later it is a prompt full of irrelevant history. By day 30 the p95 prompt is often 15 to 20k tokens of soup. Evals on turn-1 prompts pass; evals on turn-12 prompts cannot even be constructed because no two cases have the same shape. We replace MessagesState with a CaseState(BaseModel) where history is list[Turn] = Field(max_length=12) and a compact_history() function runs before every LLM send. Prompt shape becomes a function of turn count, not wall-clock accumulation. The eval harness can now build deterministic prompts for any turn.
Why move with_structured_output() out of the supervisor node?
When the supervisor uses with_structured_output() directly, two things get coupled: routing and a single vendor's function-calling API. Swapping vendors means rewriting the supervisor. Worse, the structured output class usually has a free-text reason field, which means routing is effectively controlled by a string the eval cannot grade. Our replacement: the supervisor returns a typed RouteDecision with an enum next_node: Literal[...] and a score: float. Structured output calls stay inside the scorer nodes that need domain schemas. model_adapter.py is the only file that talks to provider SDKs. Swapping Bedrock for Vertex becomes a one-line change in pick_primary().
What CI check keeps these patterns out once we have removed them?
A shell script in scripts/langgraph_lint.sh that runs five ripgrep searches. Each expected match count is 0 in src/ (tests can use whatever). One check for Command(goto=, one for interrupt(, one for MessagesState imports, one for with_structured_output( in supervisor files, one for MemorySaver. Plus a guard that add_messages only appears on fields with a max_length. The script exits non-zero on any violation. We wire it into .github/workflows/evals.yml as a required status check on main. A PR that re-introduces any of the five patterns cannot merge. It is 40 lines of bash and it has never been the reason an engagement slipped.
Does this mean LangGraph is a bad choice for multi agent orchestration?
No. LangGraph is the framework we use on most LangChain-stack engagements. What we remove are five specific idioms the framework offers that look elegant in tutorials and fail silently in production. The framework itself is solid: StateGraph, add_node, add_edge, add_conditional_edges, PostgresSaver, and the compile/stream APIs are genuinely useful. We just force the implementation to use the string-searchable, type-checkable subset. The five deletions narrow the API surface; they do not replace the framework.
How long does the week-1 cleanup actually take on a real client repo?
Three to five working days on a 500 to 2,000 line LangGraph codebase. Day 1: run the grep script, write down the counts, score the repo. Day 2: replace MessagesState with a typed CaseState and wire add_messages to a bounded field. Day 3: replace every Command(goto=) with add_conditional_edges plus a pure router function. Day 4: pull with_structured_output out of supervisor nodes into a RouteDecision shape. Day 5: swap MemorySaver for PostgresSaver, wire audit_rows, land the langgraph_lint.sh CI gate. If the existing graph is a 30-edge tree of sub-agents, it takes longer because the honest fix is to split the graph into two orchestrators with a queue between them, and that is closer to two weeks.
What does your leave-behind look like at the end of week 6?
One senior engineer has been named on your repo for the full six weeks. What stays after they leave: a graph.py with 6 to 8 named nodes and 7 to 8 edges, a state.py with a typed Pydantic CaseState, a model_adapter.py with pick_primary and pick_fallback, an eval harness under tests/eval/ replaying a frozen golden set nightly, a langgraph_lint.sh plus a graph_walker.py in CI as required status checks, a Postgres schema with audit_rows and checkpoints in the same database, and a runbook.md keyed to the alerts your on-call will actually get paged on. No vendor runtime. No platform license. Your team can tighten or loosen every number in a PR on the same afternoon they decide to.
Why do you insist on a typed state object when LangGraph accepts TypedDict?
TypedDict is checked at type-check time and ignored at runtime. A bug that writes state['handoff_budget'] = -1 runs fine until the conditional edge divides by it. A pydantic BaseModel with Field(ge=0, le=3) raises on the assignment. For LangChain multi agent orchestration graphs where state is the only thing stitching agents together, we want the type system to catch the error at the write site, not at the read site three nodes later. It costs a few microseconds per transition and saves an unknowable number of 2am debug sessions.
How do you know these five patterns are the worst ones and not the tenth-worst?
Every engagement we do starts with reading the client's existing code against a checklist of LangGraph idioms that have broken a previous client. The list was 11 entries long as of mid-2025; it sits at 5 now because the other 6 moved out of LangGraph into fixed framework behavior or are rare enough that they do not repeat. The five that remain repeat in every engagement. If you are reading this in 2027 and your repo has a sixth antipattern we have not listed, that is an engagement, not a docs bug.
Adjacent guides
More on shipping production multi agent systems
The numerical ceiling we refuse to cross
Eight edges, a three-turn handoff budget, two models in the critical path. The four numbers we publish, the MAST failure modes each one caps, and the conditional edge that enforces it.
The 2am on-call test
The four files we leave behind on main, the per-transition audit row schema, and the 10-minute diagnosis rubric for a broken turn.
Framework selection matrix
Five shipped systems, three orchestration patterns. When LangGraph, when Pydantic AI, when a custom DAG.