Guide, keyword: multi agent orchestration
The framework debate is the wrong debate. Pick the shape, not the vendor.
Every vendor page for multi agent orchestration tells you to pick their framework. We are framework neutral by contract, and we have shipped five named production systems using three different orchestration patterns. This guide reverse-engineers those five decisions into a selection matrix you can run yourself. Anchor example included: why OpenArt’s multi-scene video pipeline is a custom DAG and not a LangGraph.
LangGraph, Pydantic AI, custom DAGs. Whatever your graph wants.
Monetizy.ai / Upstate Remedial / OpenLaw / PriceFox / OpenArt
Why vendor pages cannot answer this question
Search multi agent orchestration and the top results will each recommend their own framework. LangChain pages push LangGraph. CrewAI pages push CrewAI. Microsoft pushes AutoGen. n8n pushes its visual builder. None of them can publish a cross-framework selection matrix, because admitting that another framework is the right fit for some problem shape is against their business model.
We can. We do not sell a framework. We sell a named senior engineer who ships the system, checks the framework rationale into your repo next to graph.py, and leaves. That lets us say the quiet part out loud: the framework is downstream of the problem shape.
“Shipped production multi agent systems since 2024, named on /wins. Three used LangGraph or Pydantic AI. Two used custom DAGs. The decision was always a function of the problem shape, never the framework's roadmap.”
PIAS case studies: Monetizy.ai, Upstate Remedial, OpenLaw, PriceFox, OpenArt
The decision matrix, as a single question: what shape is your state?
Two columns. Left: the multi agent orchestration pattern you want when the workflow is one linear typed state (LangGraph-shaped). Right: the pattern you want when the state is a tree of sub-states (custom-DAG-shaped). Pydantic AI fits when neither of those is the interesting question and you just need a scored pipeline with typed tools.
| Feature | DAG-shaped problem | LangGraph-shaped problem |
|---|---|---|
| Shape of the workflow | Tree of sub-workflows with per-branch gates | Linear, one auditable conversation per case |
| State model you want | Per-scene or per-tenant state, merged later | Single typed state, mutated across nodes |
| Fallback path | Retry-and-repair loop scoped to one scene | Conditional edge to a second model on SLO breach |
| Observability need | Offline eval on scene quality, not per-turn | Every transition logged to Postgres for compliance |
| Failure domain | A bad scene is retried; a failed render is rejected | A wrong email is a regulatory incident |
| Who reads the graph | Pipeline engineer during business hours | On-call engineer at 2am |
| Recommended pattern | Custom DAG with per-node quality gates | LangGraph stateful graph + audit hooks |
The five systems, the five choices
Each card below is a named production system with the framework choice that shipped, and the one-sentence reason we picked it. The client names are on /wins. The stacks are on the public case studies page. The framework rationale is here.
Monetizy.ai: Pydantic AI
Auto-orchestrated email campaign, ~8K per day. Short turn count, deliverability scoring, single-model path with retrieval. We picked Pydantic AI because the orchestration was a scored pipeline, not a conversation with branches. Shipped in 1 week.
Upstate Remedial: LangGraph
Legal-compliance email flow for auto-debt notices. Bedrock primary, OpenAI as conditional-edge fallback, deterministic compliance node before every drafter turn, per-transition row in Postgres. LangGraph's stateful graph was the right model because the workflow is a single auditable conversation. 400K+ emails sent.
OpenLaw: Custom + citation subagent
AI-native law editor ("Cursor for Lawyers"). Domain retrieval, citation verification as a separate agent, red-team eval rubric scored by licensed attorneys. We kept orchestration custom because the citation-verification pass needed its own model and its own failure policy, independent of the drafter.
PriceFox: Automated eval CI, nightly
Multi-tenant retrieval agent, automated ML engineering pipeline. Retrieval tuning, prompt variants, offline eval, canary rollouts run nightly. Human sign-off only on regression-threshold breaches. The orchestration sits in CI, not in a live graph: the problem is the eval loop, not the turn routing.
OpenArt: Custom scene-graph DAG
Multi-scene commercial video auto-generation. Scene-graph generation, per-scene quality gate, prompt-repair retry per scene. LangGraph would have forced the scene tree into turns and lost the per-scene boundary. A custom DAG gave us true per-scene retry semantics without flattening the graph.
Anchor: why OpenArt is a custom DAG and not a LangGraph
OpenArt’s pipeline generates multi-scene commercial video. Every scene is its own sub-problem: draft, quality-gate, maybe prompt-repair, gate again, emit. The gate score, the retry budget, and the repair prompt are per-scene, not per-turn. The scenes can run in parallel and only assemble in post-order.
LangGraph models state as one typed object mutating across nodes. To put the scene tree in that shape you have to flatten it into a sequence of turns, which serializes what should be parallel and conflates per-scene retries with conversation-level retries. The abstraction starts costing you real retries and real latency. That is when you build a small custom DAG instead, in about thirty lines of Python, and model the problem exactly.
Counter-example: why Upstate Remedial is LangGraph
Same company, different problem shape, different framework. Upstate Remedial’s auto-debt email flow is one linear conversation per case: intake, compliance check, drafter, reviewer, end. The state is one typed object. The dominant requirement is that every transition writes an audit row into Postgres that a regulator can read, and that one specific edge (Bedrock to OpenAI) fires when the latency budget breaks. LangGraph makes those two things one-liners.
The two wiring shapes, drawn
Upstate is the diagram on the left side of every “production agent” deck: a router with a named fallback edge and an audit sink. OpenArt is the diagram that does not fit the deck: sources fan into a hub, but the hub fans out to a set of per-scene agents that each run their own retry loop before the reviewer assembles.
Upstate Remedial (LangGraph): linear router with fallback and audit
OpenArt (custom DAG): scene fan-out with per-scene gate + repair
The five-step decision, in order
This is the literal questionnaire our engineers run on a scoping call. Start at the failure domain and walk down. If the first answer does not settle the framework, the next one will.
1. Name the failure domain
A wrong outbound is a compliance incident. A bad scene is a retried scene. A stale retrieval score is a ranking regression. The shape of the worst case picks the shape of the graph.
2. Draw the state
If the state is one typed object that mutates across nodes, LangGraph is the cheapest abstraction. If it is a tree of sub-states that merge at the end, you want per-branch isolation and a custom DAG pays for itself.
3. Locate the fallback
Model fallback in one place (a conditional edge) is LangGraph territory. Fallback scoped to a sub-branch (per-scene retry with prompt repair) is DAG territory. Fallback implemented as a scored re-rank is Pydantic AI territory.
4. Decide who reads the runbook at 2am
If it is an on-call with pager access and a compliance question, the graph must emit per-transition audit rows your data team can query. If it is a pipeline engineer during business hours, offline eval metrics dominate.
5. Pick the smallest framework that honors the shape
Pydantic AI if the problem is a scored pipeline. LangGraph if the problem is a stateful conversation. A custom DAG if the state is a tree. No ceremony, no license, no platform attached.
Vendor loyalty vs. shape-first
Pick one framework once, force every problem into its shape for two years, rebuild at the next architecture review.
- Scene graph flattened into LangGraph turns
- Scored pipeline rebuilt as a state machine
- Audit hooks bolted onto a framework that did not want them
- One vendor's conference roadmap on your critical path
Anchor fact
Same consultancy, same year, opposite framework choices. Both shipped.
On OpenArt we picked a custom DAG because a multi-scene pipeline is a tree of sub-states with per-scene retries and independent quality gates. On Upstate Remedial we picked LangGraph because a compliance email flow is one typed state with a named Bedrock-to-OpenAI fallback edge and a per-transition audit row. On Monetizy.ai we picked Pydantic AI because the problem was a scored pipeline, not a conversation. The framework is downstream of the shape. If your current vendor picks the same framework for all three, that is the tell.
Receipts
The numbers below are from named production systems. No invented benchmarks, no sector averages. The counts are checkable on /wins.
The 0K+ email count is on the LangGraph system at Upstate Remedial; the 0K/day throughput is on the Pydantic AI system at Monetizy.ai. Both numbers come from the public case studies.
Want the decision run on your problem, not ours?
Sixty-minute scoping call with the senior engineer who would own the build. You leave with a written one-pager: the problem shape, the framework we would pick, the rationale, and a fixed weekly rate. No framework picked on the call is billed as a commitment.
Book the call →Multi agent orchestration, answered
When should we reach for LangGraph on a multi agent orchestration project?
When the workflow is a single linear conversation with a typed state, and the dominant requirement is per-transition observability (think compliance, audit, rubric-gated merge). LangGraph's stateful graph plus its conditional-edge model makes fallback routing a first-class citizen, and the transition boundary is the natural place to write an audit row. We use it at Upstate Remedial because the failure domain is a regulatory incident, so every node needs a queryable trace in Postgres.
When is Pydantic AI a better choice than LangGraph?
When the orchestration is a scored pipeline rather than a conversation: retrieve, score, send. Fewer turns, simpler state, no branching on per-message rubrics. Monetizy.ai is the canonical shape: generate, personalize, deliverability-score, send. Pydantic AI gave us typed tool calling and eval without the ceremony of a state graph. We shipped the first production run in one week.
Why would you ever build a custom DAG instead of using LangGraph or CrewAI?
When the state is a tree, not a turn. On OpenArt's multi-scene video pipeline, every scene is its own sub-DAG with its own quality gate and its own prompt-repair retry budget. Modeling that inside a LangGraph would require flattening the scene tree into turns and losing the per-scene boundary; the compliance of the abstraction would start costing us real retries. A custom DAG keeps the per-scene retry semantics intact and lets us run the scene gates in parallel.
Do you have a framework of choice?
No, and that is the whole point. We have shipped LangGraph, Pydantic AI, and custom DAGs. Whichever framework we pick for your system, the selection lives next to a written rationale, so a future engineer who replaces us can re-evaluate and swap without calling us. Framework neutrality is in the MSA; no platform license is signed as part of the engagement.
What is the anchor decision on OpenArt's custom DAG?
We rejected LangGraph because its graph treats state as one object that mutates across nodes. Our scenes are independent sub-states that each carry a retry budget and a gate score, then get assembled in post-order. Flattening that into LangGraph turns would serialize scenes that should run in parallel and would conflate a scene-level retry with a conversation-level retry. The custom DAG is thirty lines of Python, and it models the problem exactly.
Can you take our existing LangGraph or CrewAI prototype and make it production-ready?
Most engagements start that way. We do a week 0 read of your graph, identify nodes with no fallback, no eval, or no audit signal, and commit to production-hardening them against a named rubric. We do not rewrite working code. If the framework you picked was the wrong fit for the problem shape, we will say so on the scoping call, with the reasoning, before you spend another sprint on it.
How do you handle model-vendor neutrality inside the orchestrator?
The model boundary lives behind a small adapter in every system we ship. On Upstate Remedial it is pick_primary() and pick_fallback() returning Bedrock and OpenAI handles. On OpenArt it is a model router keyed to scene type. You keep the keys, you keep the bill, and we never sign a vendor agreement that makes it hard to swap a provider.
What leave-behinds arrive in our repo at the end of a multi agent orchestration engagement?
Four artifacts checked in on main: the orchestration definition itself (graph.py, pipeline.py, or equivalent), an eval harness with ragas plus a case-specific rubric running in your GitHub Actions, a failure playbook that names the fallback model and the audit-log schema, and a runbook keyed to your on-call rotation. A 90-minute handoff session with the on-call team, then we leave. Same engineer stays available for paid 2-hour consults at a capped rate for 12 months.
Adjacent guides
More on shipping production AI
The 6-week FDE engagement model
Week 0 scoping, week 2 prototype gate, week 6 handoff. The rubric your board can hold us to.
Shipped systems, cited on the record
Monetizy.ai, Upstate Remedial, OpenLaw, PriceFox, OpenArt. Named clients, production metrics, per-system stacks.
Why PIAS exists
AI pilots fail at the handoff. We embed a named engineer, ship in weeks, leave the eval and runbook in your repo.