Guide, keyword: multi agent orchestration

The framework debate is the wrong debate. Pick the shape, not the vendor.

Every vendor page for multi agent orchestration tells you to pick their framework. We are framework neutral by contract, and we have shipped five named production systems using three different orchestration patterns. This guide reverse-engineers those five decisions into a selection matrix you can run yourself. Anchor example included: why OpenArt’s multi-scene video pipeline is a custom DAG and not a LangGraph.

Matthew Diakonov, PIAS, forward-deployed ML engineering

Published April 18, 202611 min read

4.9from named clients, production metrics

5 shipped systems, 3 orchestration patterns

No platform license, no vendor-attached runtime

Framework rationale checked into your repo

3 patterns. 5 systems. One selection matrix.

Framework neutrality, not framework loyalty

3 orchestration patterns across 5 shipped systems

LangGraph for linear auditable state

Pydantic AI for lightweight scored retrieval

Custom DAG when the state is a tree, not a turn

Framework neutral by contract, not by slogan

0:00 / 0:07

LangGraph, Pydantic AI, custom DAGs. Whatever your graph wants.

Monetizy.ai / Upstate Remedial / OpenLaw / PriceFox / OpenArt

Book the scoping call

LangGraphPydantic AICrewAIAutoGenCustom DAGBedrockVertexAzure OpenAIAnthropicpgvectorragasMCPA2ALangSmithOTEL

Why vendor pages cannot answer this question

Search multi agent orchestration and the top results will each recommend their own framework. LangChain pages push LangGraph. CrewAI pages push CrewAI. Microsoft pushes AutoGen. n8n pushes its visual builder. None of them can publish a cross-framework selection matrix, because admitting that another framework is the right fit for some problem shape is against their business model.

We can. We do not sell a framework. We sell a named senior engineer who ships the system, checks the framework rationale into your repo next to graph.py, and leaves. That lets us say the quiet part out loud: the framework is downstream of the problem shape.

“Shipped production multi agent systems since 2024, named on /wins. Three used LangGraph or Pydantic AI. Two used custom DAGs. The decision was always a function of the problem shape, never the framework's roadmap.”

PIAS case studies: Monetizy.ai, Upstate Remedial, OpenLaw, PriceFox, OpenArt

The decision matrix, as a single question: what shape is your state?

Two columns. Left: the multi agent orchestration pattern you want when the workflow is one linear typed state (LangGraph-shaped). Right: the pattern you want when the state is a tree of sub-states (custom-DAG-shaped). Pydantic AI fits when neither of those is the interesting question and you just need a scored pipeline with typed tools.

Feature	DAG-shaped problem	LangGraph-shaped problem
Shape of the workflow	Tree of sub-workflows with per-branch gates	Linear, one auditable conversation per case
State model you want	Per-scene or per-tenant state, merged later	Single typed state, mutated across nodes
Fallback path	Retry-and-repair loop scoped to one scene	Conditional edge to a second model on SLO breach
Observability need	Offline eval on scene quality, not per-turn	Every transition logged to Postgres for compliance
Failure domain	A bad scene is retried; a failed render is rejected	A wrong email is a regulatory incident
Who reads the graph	Pipeline engineer during business hours	On-call engineer at 2am
Recommended pattern	Custom DAG with per-node quality gates	LangGraph stateful graph + audit hooks

The five systems, the five choices

Each card below is a named production system with the framework choice that shipped, and the one-sentence reason we picked it. The client names are on /wins. The stacks are on the public case studies page. The framework rationale is here.

Monetizy.ai: Pydantic AI

Auto-orchestrated email campaign, ~8K per day. Short turn count, deliverability scoring, single-model path with retrieval. We picked Pydantic AI because the orchestration was a scored pipeline, not a conversation with branches. Shipped in 1 week.

Upstate Remedial: LangGraph

Legal-compliance email flow for auto-debt notices. Bedrock primary, OpenAI as conditional-edge fallback, deterministic compliance node before every drafter turn, per-transition row in Postgres. LangGraph's stateful graph was the right model because the workflow is a single auditable conversation. 400K+ emails sent.

OpenLaw: Custom + citation subagent

AI-native law editor ("Cursor for Lawyers"). Domain retrieval, citation verification as a separate agent, red-team eval rubric scored by licensed attorneys. We kept orchestration custom because the citation-verification pass needed its own model and its own failure policy, independent of the drafter.

PriceFox: Automated eval CI, nightly

Multi-tenant retrieval agent, automated ML engineering pipeline. Retrieval tuning, prompt variants, offline eval, canary rollouts run nightly. Human sign-off only on regression-threshold breaches. The orchestration sits in CI, not in a live graph: the problem is the eval loop, not the turn routing.

OpenArt: Custom scene-graph DAG

Multi-scene commercial video auto-generation. Scene-graph generation, per-scene quality gate, prompt-repair retry per scene. LangGraph would have forced the scene tree into turns and lost the per-scene boundary. A custom DAG gave us true per-scene retry semantics without flattening the graph.

Anchor: why OpenArt is a custom DAG and not a LangGraph

OpenArt’s pipeline generates multi-scene commercial video. Every scene is its own sub-problem: draft, quality-gate, maybe prompt-repair, gate again, emit. The gate score, the retry budget, and the repair prompt are per-scene, not per-turn. The scenes can run in parallel and only assemble in post-order.

LangGraph models state as one typed object mutating across nodes. To put the scene tree in that shape you have to flatten it into a sequence of turns, which serializes what should be parallel and conflates per-scene retries with conversation-level retries. The abstraction starts costing you real retries and real latency. That is when you build a small custom DAG instead, in about thirty lines of Python, and model the problem exactly.

pipeline.py

Counter-example: why Upstate Remedial is LangGraph

Same company, different problem shape, different framework. Upstate Remedial’s auto-debt email flow is one linear conversation per case: intake, compliance check, drafter, reviewer, end. The state is one typed object. The dominant requirement is that every transition writes an audit row into Postgres that a regulator can read, and that one specific edge (Bedrock to OpenAI) fires when the latency budget breaks. LangGraph makes those two things one-liners.

graph.py

The two wiring shapes, drawn

Upstate is the diagram on the left side of every “production agent” deck: a router with a named fallback edge and an audit sink. OpenArt is the diagram that does not fit the deck: sources fan into a hub, but the hub fans out to a set of per-scene agents that each run their own retry loop before the reviewer assembles.

Upstate Remedial (LangGraph): linear router with fallback and audit

OpenArt (custom DAG): scene fan-out with per-scene gate + repair

The five-step decision, in order

This is the literal questionnaire our engineers run on a scoping call. Start at the failure domain and walk down. If the first answer does not settle the framework, the next one will.

1. Name the failure domain

A wrong outbound is a compliance incident. A bad scene is a retried scene. A stale retrieval score is a ranking regression. The shape of the worst case picks the shape of the graph.

2. Draw the state

If the state is one typed object that mutates across nodes, LangGraph is the cheapest abstraction. If it is a tree of sub-states that merge at the end, you want per-branch isolation and a custom DAG pays for itself.

3. Locate the fallback

Model fallback in one place (a conditional edge) is LangGraph territory. Fallback scoped to a sub-branch (per-scene retry with prompt repair) is DAG territory. Fallback implemented as a scored re-rank is Pydantic AI territory.

4. Decide who reads the runbook at 2am

If it is an on-call with pager access and a compliance question, the graph must emit per-transition audit rows your data team can query. If it is a pipeline engineer during business hours, offline eval metrics dominate.

5. Pick the smallest framework that honors the shape

Pydantic AI if the problem is a scored pipeline. LangGraph if the problem is a stateful conversation. A custom DAG if the state is a tree. No ceremony, no license, no platform attached.

Vendor loyalty vs. shape-first

Pick one framework once, force every problem into its shape for two years, rebuild at the next architecture review.

Scene graph flattened into LangGraph turns
Scored pipeline rebuilt as a state machine
Audit hooks bolted onto a framework that did not want them
One vendor's conference roadmap on your critical path

Anchor fact

Same consultancy, same year, opposite framework choices. Both shipped.

On OpenArt we picked a custom DAG because a multi-scene pipeline is a tree of sub-states with per-scene retries and independent quality gates. On Upstate Remedial we picked LangGraph because a compliance email flow is one typed state with a named Bedrock-to-OpenAI fallback edge and a per-transition audit row. On Monetizy.ai we picked Pydantic AI because the problem was a scored pipeline, not a conversation. The framework is downstream of the shape. If your current vendor picks the same framework for all three, that is the tell.

Receipts

The numbers below are from named production systems. No invented benchmarks, no sector averages. The counts are checkable on /wins.

0Shipped production multi agent systems, named on /wins

0Orchestration patterns picked across those five

0K+Emails on the LangGraph system at Upstate Remedial

0K/dayMessages on the Pydantic AI system at Monetizy.ai

The 0K+ email count is on the LangGraph system at Upstate Remedial; the 0K/day throughput is on the Pydantic AI system at Monetizy.ai. Both numbers come from the public case studies.

Want the decision run on your problem, not ours?

Sixty-minute scoping call with the senior engineer who would own the build. You leave with a written one-pager: the problem shape, the framework we would pick, the rationale, and a fixed weekly rate. No framework picked on the call is billed as a commitment.

Book the call →

Multi agent orchestration, answered

When should we reach for LangGraph on a multi agent orchestration project?

When the workflow is a single linear conversation with a typed state, and the dominant requirement is per-transition observability (think compliance, audit, rubric-gated merge). LangGraph's stateful graph plus its conditional-edge model makes fallback routing a first-class citizen, and the transition boundary is the natural place to write an audit row. We use it at Upstate Remedial because the failure domain is a regulatory incident, so every node needs a queryable trace in Postgres.

When is Pydantic AI a better choice than LangGraph?

When the orchestration is a scored pipeline rather than a conversation: retrieve, score, send. Fewer turns, simpler state, no branching on per-message rubrics. Monetizy.ai is the canonical shape: generate, personalize, deliverability-score, send. Pydantic AI gave us typed tool calling and eval without the ceremony of a state graph. We shipped the first production run in one week.

Why would you ever build a custom DAG instead of using LangGraph or CrewAI?

When the state is a tree, not a turn. On OpenArt's multi-scene video pipeline, every scene is its own sub-DAG with its own quality gate and its own prompt-repair retry budget. Modeling that inside a LangGraph would require flattening the scene tree into turns and losing the per-scene boundary; the compliance of the abstraction would start costing us real retries. A custom DAG keeps the per-scene retry semantics intact and lets us run the scene gates in parallel.

Do you have a framework of choice?

No, and that is the whole point. We have shipped LangGraph, Pydantic AI, and custom DAGs. Whichever framework we pick for your system, the selection lives next to a written rationale, so a future engineer who replaces us can re-evaluate and swap without calling us. Framework neutrality is in the MSA; no platform license is signed as part of the engagement.

What is the anchor decision on OpenArt's custom DAG?

We rejected LangGraph because its graph treats state as one object that mutates across nodes. Our scenes are independent sub-states that each carry a retry budget and a gate score, then get assembled in post-order. Flattening that into LangGraph turns would serialize scenes that should run in parallel and would conflate a scene-level retry with a conversation-level retry. The custom DAG is thirty lines of Python, and it models the problem exactly.

Can you take our existing LangGraph or CrewAI prototype and make it production-ready?

Most engagements start that way. We do a week 0 read of your graph, identify nodes with no fallback, no eval, or no audit signal, and commit to production-hardening them against a named rubric. We do not rewrite working code. If the framework you picked was the wrong fit for the problem shape, we will say so on the scoping call, with the reasoning, before you spend another sprint on it.

How do you handle model-vendor neutrality inside the orchestrator?

The model boundary lives behind a small adapter in every system we ship. On Upstate Remedial it is pick_primary() and pick_fallback() returning Bedrock and OpenAI handles. On OpenArt it is a model router keyed to scene type. You keep the keys, you keep the bill, and we never sign a vendor agreement that makes it hard to swap a provider.

What leave-behinds arrive in our repo at the end of a multi agent orchestration engagement?

Four artifacts checked in on main: the orchestration definition itself (graph.py, pipeline.py, or equivalent), an eval harness with ragas plus a case-specific rubric running in your GitHub Actions, a failure playbook that names the fallback model and the audit-log schema, and a runbook keyed to your on-call rotation. A 90-minute handoff session with the on-call team, then we leave. Same engineer stays available for paid 2-hour consults at a capped rate for 12 months.

Adjacent guides