Guide, keyword: multi agent orchestration frameworks

Every framework page tells you how to ship.

None grades on how to leave.

The framework under audit:

Multi agent orchestration framework round-ups grade on speed, token count, or pattern taxonomy. None measure portability, because no vendor can afford to publish the answer. We can. Our engagement ends at week 6 and the engineer walks out, so whatever framework we wire into your repo has to survive us. This guide is the scoring rubric, applied to seven named frameworks, with the anchor code from a shipped production system.

Matthew Diakonov, PIAS, forward-deployed ML engineering

Published April 18, 202612 min read

4.9from named production clients

7 frameworks scored on 5 portability axes

Anchor code from a live 400K+ email system

No platform license, no vendor-attached runtime

7 frameworks. 5 axes. One score out of 25.

The Portability Index for multi agent orchestration

Every framework page hides the same number: how hard it is to leave

We write graph.py, ship, and walk away at week 6

So whatever we pick has to survive us

Five criteria. Seven frameworks. Scored out of 25.

Framework neutrality is a line in the MSA, not a slogan

0:00 / 0:08

Anchored in shipped code, not benchmark tables.

Monetizy.ai / Upstate Remedial / OpenLaw / PriceFox / OpenArt

Book the portability audit

LangGraphPydantic AICrewAIAutoGenMicrosoft Agent FrameworkOpenAI Agents SDKGoogle ADKCustom DAGBedrockVertexAzure OpenAIAnthropicpgvectorragasOTELLangSmithLogfire

The metric no vendor can publish

Search multi agent orchestration frameworks and the top results all grade on the same two axes: tokens per task and pattern coverage. LangGraph pages cite token efficiency. CrewAI pages cite role ergonomics. Microsoft pages cite event-driven cores. Google ADK pages cite model-agnosticism.

What none of them publish is the real migration cost. When the engineer who wrote the system leaves, or the framework vendor pivots, or your CTO reorganises the AI team, the only metric that matters is portability. And portability is exactly the metric a framework vendor cannot honestly grade their own framework on.

We can. We do not sell the framework. We sell a named senior engineer who ships, commits the rationale next to graph.py, and leaves at week 6. Framework neutrality is a contract line in the MSA, not a slogan. The Portability Index below is the rubric that line enforces.

The Portability Index, five axes

Runtime portability: does it run on your infra, or does a SaaS control plane hold the keys?
State portability: is the workflow state a typed object you own, or a framework primitive (Crew, GroupChat, RunContext) you cannot export?
Observability portability: can you swap tracing for plain OTEL plus your own database, or are you locked to a vendor dashboard?
API leakage: does the framework live in one orchestration file, or do decorators and base classes leak into every agent, tool, and task?
License and identity coupling: is the production path free and self-hosted, or does a paid tier own retries, memory, or the default identity provider?

5/25

“Every axis scores 1 to 5. A framework that scores under 3 on any single axis is a framework you should only sign for if you are planning to stay on it for the next two years. Under 15 total, we decline the pick.”

PIAS Portability Index, applied across five shipped production systems

Seven frameworks, graded

Each card names the framework, the score, and the one-sentence portability verdict. Scores are drawn from the five-axis rubric and from our actual experience shipping against the framework in production. No vendor relationships, no reseller margin.

Custom DAG (plain Python)

Score 25 / 25. The only "framework" that literally moves by being rewritten. We use it on OpenArt because the state is a tree of scenes with per-scene gates and per-scene retry budgets. No imports to replace later.

Pydantic AI

Score 22 / 25. Typed tool calling over Pydantic models you already own. Light leakage through agent glue. Shipped at Monetizy.ai in one week because the orchestration was a scored pipeline, not a conversation.

LangGraph

Score 22 / 25. Contained to graph.py. State class stays framework-free, audit hook stays framework-free, model adapters stay framework-free. LangSmith is a nudge, not a requirement. Shipped at Upstate Remedial on 400K+ emails.

OpenAI Agents SDK

Score 15 / 25. Free runtime and clean handoffs, but model coupling to OpenAI and traces in an OpenAI dashboard mean you pay twice to leave. Fine for a scored single-vendor pilot; a migration cost when the vendor changes pricing.

Microsoft Agent Framework

Score 15 / 25. The AutoGen + Semantic Kernel merger. Strong event-driven core, but the RC happy path leans on Azure Monitor and Azure AD. Clean choice inside an Azure shop; a surprise commitment if you are not one yet.

Google ADK

Score 15 / 25. Model-agnostic on the tin, GCP-first in practice. Cloud Trace is the default observability story, Vertex is the default model story. Good fit for Gemini-heavy estates; a drag on a Bedrock-first shop.

CrewAI

Score 13 / 25. Role-based ergonomics are genuinely nice. But Crew, Agent, Task, Process.hierarchical, and the @tool decorator leak into every file, and the managed cloud tier is the happy path. Replacement means rewriting the whole fleet, not one file.

Portable by default vs lock-in by default

Two columns. Left: the shape of a framework pick that preserves optionality. Right: the shape that quietly takes it. Every engagement starts by reading your current code against this table before we recommend anything.

Feature	Lock-in by default	Portable by default
Where the framework lives in your code	Decorators and base classes sprinkled across every agent file.	One orchestration file. Everything else is your code.
Where state is defined	Framework primitive (Crew, GroupChat, RunContext) you cannot serialize without the framework.	Typed dataclass or Pydantic model with zero framework imports.
Where traces go by default	Vendor dashboard behind a paid tier, API keys in someone else's account.	OTEL collector or plain rows in your Postgres.
What a framework swap costs	Rewrite every file that imports the framework, plus re-home observability.	Rewrite the wiring file. State, audit, adapters unchanged.
What happens when your lead engineer leaves	Next engineer learns a framework-specific mental model or rewrites.	Next engineer reads one file plus a rationale memo.
What happens when the framework vendor pivots	You are on their roadmap, their pricing, and their deprecation cycle.	You do nothing. Your code is yours.

Anchor: the state file that does not import the framework

On Upstate Remedial’s shipped LangGraph system, the state lives in state.py. Nothing in this file imports LangGraph. This is the file that makes the framework portable: replace graph.py with any other orchestrator and the state schema, the types, and the invariants all survive the port.

state.py

Anchor: the audit hook that does not import the framework either

The per-transition audit row is the compliance requirement that made us pick LangGraph in the first place. The function that writes it takes CaseState and a node name, and that is all. In LangGraph it is attached by graph.add_node_hook(node, after=record_turn); in a custom DAG we call it at the end of each step; in Pydantic AI we hook it on the agent’s post-call callback. The caller changes. The file below does not.

audit.py

One file vs every file

The same orchestration, written two ways. Left, a CrewAI sketch of the same debt-notice flow: Crew, Agent, Task, and Process primitives and the @tool decorator all leak into each file. Right, the LangGraph wiring we shipped: the framework lives inside graph.py and nothing else. Every other file we write is yours after we leave.

Same flow, two framework surfaces

# crew.py - CrewAI (sketch). Framework primitives leak into every file.
# Replacement means rewriting each agent, each tool, and the Crew definition.
from crewai import Agent, Task, Crew, Process
from crewai.tools import tool

@tool("compliance_check")
def compliance_check(account_number: str) -> dict:
    return {"status": "pass"}

drafter = Agent(
    role="Email Drafter",
    goal="Write a compliant debt-notice email",
    backstory="You draft under regulatory oversight.",
    tools=[compliance_check],
    allow_delegation=False,
)
reviewer = Agent(
    role="Rubric Reviewer",
    goal="Score drafts on a compliance rubric",
    backstory="You reject any draft scoring below 0.82.",
    allow_delegation=True,
)

draft_task = Task(
    description="Draft the notice for {account_number}",
    agent=drafter,
    expected_output="A policy-compliant email body.",
)
review_task = Task(
    description="Score and return only a passing draft",
    agent=reviewer,
    expected_output="One accepted draft or a rejection note.",
)

crew = Crew(
    agents=[drafter, reviewer],
    tasks=[draft_task, review_task],
    process=Process.hierarchical,
    manager_llm="gpt-4o",
    memory=True,
)

24% files imported from framework

What we grep for on the scoping call

The audit is usually ten minutes of ripgrep. Count how many files import the framework, and count how many files import nothing. The ratio is the Portability Index in one shell session. Below is the real output from the Upstate Remedial audit.

portability-audit.sh

What walks out of your repo at week 6

The portable leave-behind is the point. The engineer walks out; the framework-agnostic files stay. Sources flow into the orchestrator file we own with you; outputs are the four artifacts you can hand to the next engineer without handholding.

Portable leave-behinds: what survives a framework swap

The four-step audit we run before picking any framework

Portability audit, in order

1
Name the leave-behind
Before picking a framework, write the list of files that must still work after the author leaves. graph.py, state.py, eval harness, audit schema, runbook.
2
Score the five axes
Run each candidate framework through runtime, state, observability, leakage, and license. If any axis scores 1 or 2 and cannot be mitigated, drop it.
3
Write the port plan
On paper, in commit-ready form: "If we replace LangGraph with X, we rewrite graph.py in ~40 lines. State, adapters, and audit hooks survive." If you cannot write this, the framework is wrong.
4
Check in the rationale
Commit a RATIONALE.md next to graph.py. One paragraph: what shape the problem has, why this framework, what the migration trigger would be. Future engineers read this and can disagree with you.

Receipts

Scores are a rubric; shipped production is the check. The numbers below are tied to named systems, not benchmark suites.

0Frameworks graded on the Portability Index

0Production multi agent systems we shipped and walked away from

0Score ceiling: custom DAG with no imports at all

0K+Emails running on the LangGraph system at Upstate Remedial

The 0K+ email count runs on the LangGraph system at Upstate Remedial; the portability audit is a repeatable artifact we run in every week 0, not a one-off.

Anchor fact

On the system we shipped, one file imports LangGraph. One.

At Upstate Remedial, graph.py imports StateGraph from LangGraph. No other file does. state.py defines a plain typed dataclass. audit.py writes Postgres rows with psycopg and no framework handle. Model choices live behind pick_primary() and pick_fallback() returning Bedrock and OpenAI clients. A port to Microsoft Agent Framework or a custom DAG rewrites roughly 40 lines of graph.py. Everything else is untouched. That is the Portability Index, made concrete.

Run the Portability Index on your current orchestrator

Sixty-minute scoping call with the senior engineer who would own the audit. You leave with a written report: the framework score, the port cost, and whether the pick is wrong for the problem shape or merely expensive to leave. If it is right, we say so. No framework commitment on the call.

Book the scoping call →

Multi agent orchestration frameworks, answered

What is a Portability Index for multi agent orchestration frameworks?

A five-axis score we run on every candidate framework before we commit to it: runtime portability (does it run on your infra), state portability (is state a typed object you own), observability portability (plain OTEL versus vendor dashboard), API leakage (one file versus every file), and license or identity coupling (free and self-hosted versus paid tier gating production features). Each axis is scored 1 to 5, totaling out of 25. The axis with the lowest score is the one that will hurt during a migration or a re-staffing event, so we treat that axis as the real blast radius of the pick.

Why does portability matter more than token efficiency or speed?

Token efficiency is a one-quarter concern. Portability is a multi-year concern. The average enterprise ships, re-staffs, and restructures its AI team inside 18 months; a framework choice that is 2x faster on benchmarks but leaks into every file becomes a 6-month migration cost the next time the team changes. We have watched a CrewAI prototype become a LangGraph rewrite because the engineer who wrote it left and the next one could not get the memory and delegation behavior to match the ask. The framework that moves fastest when you leave is the one that wins.

Why does CrewAI score lower than LangGraph on your index?

CrewAI's ergonomics are real; the Crew, Agent, Task, and Process primitives make the first sprint feel fast. The cost shows up at the migration boundary. Those primitives are imported everywhere, so a replacement framework cannot just re-wire, it has to re-shape. On LangGraph we contain the framework to graph.py, and state.py plus audit.py plus the model adapters import nothing from LangChain. That gap is the portability gap. CrewAI is a great fit when the team is committing to CrewAI Cloud and expects to stay for the next two years; it is a worse fit when the team explicitly wants an exit lane.

What is the anchor code pattern that keeps LangGraph portable?

Three files of discipline. state.py defines the typed state class with zero imports from langgraph or langchain. audit.py exposes a record_turn(state, node_name) function that writes one Postgres row per transition and imports nothing from the framework. model adapters live in routes.py with pick_primary() and pick_fallback() returning raw model handles (Bedrock, OpenAI, Anthropic). graph.py is the only file that imports StateGraph, conditional edges, and hooks. If you want to port off LangGraph, you rewrite graph.py in roughly 40 lines. No other file moves.

What does Microsoft Agent Framework or Google ADK cost on the index?

Both score fine on runtime (both are open source) and on state shape, but they lose points on observability and identity coupling. Microsoft Agent Framework's happy path assumes Azure Monitor and Azure AD; Google ADK's happy path assumes Cloud Trace and Vertex. Neither is wrong inside the matching cloud, and both are the right pick inside one. They are a worse pick when the client has not committed to a single hyperscaler and wants to keep provider optionality, because the observability story is the first thing you would rewrite on a multi-cloud posture.

Why do you sometimes recommend a custom DAG instead of a framework at all?

When the problem shape is a tree of sub-states rather than a linear conversation, a framework pays rent you do not owe. OpenArt's multi-scene video pipeline is the canonical shape: each scene is its own sub-DAG with its own quality gate and its own retry budget. A LangGraph port would flatten that into turns and conflate scene-level retries with conversation-level retries. Thirty lines of plain Python model the problem exactly, score 25 out of 25 on portability, and require zero vendor page reading for the next engineer.

Do you get paid less if you recommend a custom DAG or Pydantic AI over a managed framework?

No. We bill a weekly rate for an embedded senior engineer, not a licensing margin on a framework pick. We have no reseller relationship with any framework vendor, no platform license baked into the MSA, and no vendor-attached runtime. That is why our incentive on the framework pick is clean: the recommendation is whatever minimises the blast radius when we walk out at week 6, not whatever maximizes our renewal fee.

What leaves your repo when the engagement ends?

Four artifacts checked in on main: the one orchestration file (graph.py, pipeline.py, or equivalent) with the framework rationale as a docstring, an eval harness running against your named rubric in GitHub Actions, a failure playbook naming the fallback model and the audit-log schema, and a runbook keyed to your on-call rotation. Plus a RATIONALE.md next to the orchestration file that names the trigger under which a port would make sense. The engineer then hands off in a 90-minute session and leaves. Paid two-hour consults remain available at a capped rate for twelve months.

Can you run the Portability Index on our existing multi agent orchestration framework and tell us if we should switch?

Yes. That is usually week 0 of a refactor engagement. We read your current wiring, grep for framework imports across the codebase, and return a written report: the index score, the port cost if we were to move, and whether the pick is wrong for the problem shape or merely expensive to leave. If the framework is the right pick and the problem is hardening, we say that and scope the production work. If the framework is the wrong shape, we say that too, on the scoping call, before you spend another sprint.

Adjacent guides