Guide, keyword: claude code multi agent orchestration

Spawning subagents is the easy half. The gate is the hard half.

Every Claude Code guide teaches you how to spawn subagents. None of them teach you how to keep a parallel agent workflow honest in a production repo after the senior engineer who set it up has left. We check a four-file pattern into every client repo that does the gating part: named subagents as Markdown, a CI rubric workflow, the scored rubric, and a runbook. No subagent PR merges without a rubric score.

Matthew Diakonov, Written with AI

Published April 20, 202611 min read

4.9from named production engagements

Four files on main, no platform runtime, no vendor registry

Rubric gate as a required GitHub status check on labelled PRs

Orchestrator respawns targeted subagents on rubric failure, 3 cycles

Spawn is the easy half.

The gate is what keeps multi-agent work honest in a shared repo.

Spawn is the easy half. Gate is the hard half.

Subagents live in .claude/agents/ as Markdown.

A CI workflow runs the rubric on every subagent PR.

Score below 0.82, the merge button goes grey.

Four files. No platform, no vendor runtime.

0:00 / 0:07

Four files on main. One required status check. Zero vendor runtime.

.claude/agents/ + claude-eval.yml + rubric.yaml + runbook.md

.claude/agents/reviewer.md.claude/agents/test-writer.md.claude/agents/refactor.md.claude/agents/perf-auditor.mdclaude-eval.ymlrubric.yamlrunbook.mdTask toolCLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMSragasgh pr checksCODEOWNERSGitHub ActionsMCPAnthropic APIBedrockVertex

The SERP gap: everybody teaches spawn, nobody teaches gate

The top results for this query read almost identically. Claude Code’s own docs describe how Agent Teams and the Task tool work. Community guides walk through parallel versus sequential subagent patterns. Third-party orchestrators wrap the same primitives with a prettier CLI. Each one ends at the point the subagents return and their output lands as a diff in front of a human.

That is the gap. In a solo hobby project the human reviewer is fine. In a shared repo with CODEOWNERS, branch protection, and compliance review, a reviewer asked to scan four subagents’ worth of diff on the same attention budget they used for a single engineer’s PR is not a gate, they are a rubber stamp. The right question is not whether Claude Code can spawn subagents. The right question is what your repo looks like six months later, after five different engineers have kicked off orchestration runs.

Our answer is a CI rubric. Every subagent PR runs through a reviewer subagent that scores it against a checked-in rubric, and the GitHub branch protection rule names that score as a required status check. The rubric lives at evals/rubric.yaml, so the git history records every change to it and a compliance reviewer can diff the rubric over time.

4 files

“The entire Claude Code multi agent orchestration pattern we leave behind fits in four files on main. No platform license, no vendor-hosted runtime, no fourth-party orchestrator in the loop. The senior engineer who wrote the rubric is the same engineer who wrote the subagents and the workflow.”

FDE-10X engagement model, leave-behind artefact inventory

The four files we leave behind

This is the inventory of what ends up on main at week 6. Each file has one load-bearing job and none of them depend on a SaaS we charge a license for. If we disappear, the pattern still runs, because all four files are sitting in the repo next to the rest of the code the client already owns.

1. .claude/agents/<role>.md

One Markdown file per subagent, checked into the client's repo on main. YAML frontmatter declares name, description, allowed tools, and model. The body is the system prompt. A senior engineer writing these files is the same engineer writing the graph. No SaaS console, no vendor registry, no deploy step — the subagents are source code.

2. .github/workflows/claude-eval.yml

The gate. Triggered on every PR authored with label `claude-subagent`. Runs the reviewer subagent against the diff, runs the test-writer subagent to regenerate tests, runs the rubric, posts the score as a PR check. Required status check, so the merge button is disabled below threshold.

3. evals/rubric.yaml

The scored rubric. Case-specific; not a vendor benchmark. Sections for correctness, test coverage, style conformance, security, and domain rules. Each section has a scored weight and a pass threshold. The rubric is the product commitment the engagement makes at week 2.

4. runbook.md

How a human operator reruns the orchestrator at 2am. Names the orchestrator command, the four subagents, the rubric threshold, the override procedure, and the rollback branch. Every production multi-agent system we ship has this file. The oncall engineer should be able to resolve a rubric failure without calling the original author.

Anchor file 1: the reviewer subagent

This is the full content of .claude/agents/reviewer.md from the pattern we ship. Forty lines, two of which are load-bearing: the tool allowlist (reads only, no writes, no network tools) and the explicit prohibition on recommending merges below 0.82. The rest is the system prompt that turns a Claude Code session into a disciplined code reviewer.

.claude/agents/reviewer.md

The tool allowlist is checked by Claude Code at session start; if the reviewer subagent tries to call a tool outside it, the call fails. That is how we prevent the reviewer from writing code to fix its own complaints, which is the failure mode that makes self-reviewing agents score everything as a pass.

Anchor file 2: the CI gate

This is the GitHub Actions workflow that runs on every PR labelled claude-subagent. It installs Claude Code, runs the reviewer agent in --print --output-format json mode, gates the merge on the score, and posts a PR comment on failure. No orchestration platform sits in front of it; it is a plain YAML file in .github/workflows/.

.github/workflows/claude-eval.yml

The required-status-check wiring happens in branch protection, not in the workflow. In the repo’s settings we name claude-eval as a required check for any PR against main, which is what makes the merge button grey out when the rubric fails.

How a subagent PR flows through the gate

One issue in, one merged PR out. The orchestrator is the hub; the subagents write code in their own branches, the reviewer scores the unified diff, and the merge only lands when the rubric passes. Compared to a plain spawn-and-hope flow, the added cost is one extra Claude Code call per PR and a few minutes of Actions runtime.

Orchestrated PR lifecycle, gated by claude-eval

Anchor file 3: the rubric

The rubric is the contract the engagement is priced against. It is the same document a compliance reviewer audits, and it is the same document the reviewer subagent reads at CI time. Five sections, a weighted total, per-section thresholds. Clients usually edit the domain section during week 1; everything else we ship as a starting point on day 2.

evals/rubric.yaml

Probing a live run

This is the shape of a passing run on a real client repo. The PR is labelled, the claude-eval check reports the weighted score, and the per-section breakdown lands in the JSON artifact. No custom dashboard, no SaaS; everything a reviewer needs is visible in the GitHub PR page.

probe a claude-subagent PR

The six-step lifecycle, issue to merged PR

This is what one orchestration run looks like end-to-end. We walk new clients through it in the week 1 onboarding session. Each step has a specific artifact in the repo so a future engineer can reconstruct what happened without having to ask.

1. Intent arrives in an issue

A product engineer writes a GitHub issue that names the desired behaviour and the acceptance criteria. The orchestrator agent picks it up, reads surrounding code, and decides which subagents to spawn. One issue, one orchestration run.

2. Orchestrator plans and splits

The orchestrator produces a plan file at .claude/plans/<issue>.md with a list of parallel-safe subagent invocations. Parallel-safe means no two subagents touch the same file. Where the graph has a dependency, the orchestrator serialises it.

3. Subagents run via the Task tool

The orchestrator launches each subagent with the Task tool. Each runs in its own context window against its own tool allowlist. Refactor writes, test-writer writes, perf-auditor reads. Outputs land in branches named fde/<role>/<issue>-<shortsha>.

4. A single PR is opened and labelled

The orchestrator merges the subagent branches into one PR against main and applies the label `claude-subagent`. The label is what triggers the eval workflow; unlabelled PRs skip the rubric, because a human authored them.

5. CI runs the reviewer subagent, then the rubric

claude-eval.yml runs, invokes the reviewer subagent on the diff, captures the per-section scores, and posts the weighted total as a required status check. Below 0.82 the merge button is disabled by GitHub branch protection. Above it, the check passes and normal CI continues.

6. A human approves or the orchestrator repairs

If CI passes, a CODEOWNER approves and merges. If CI fails with specific blocking sections, the orchestrator respawns only the subagents that can fix those sections, re-opens the PR, and re-runs the rubric. Three repair cycles max, then it escalates to the named engineer.

Spawn-and-hope vs gated orchestration

Left column is how most Claude Code guides end. Right column is what we actually check in. The right column costs roughly one extra CI minute per PR; the left column costs a production incident every few months on a reviewer-fatigue day.

Feature	Spawn-and-hope	Gated orchestration (four files)
What the SERP articles teach	How to spawn subagents. The implicit merge model is that a human reviews whatever the orchestrator produced. Good for a solo side project, a compliance risk in a shared repo.	A CI rubric that gates merges. The orchestrator cannot push to main; it pushes to a PR, and the PR is required to pass claude-eval before the merge button is enabled.
Where the subagent definitions live	A vendor console, a SaaS registry, or a private gist. Leaves the repo without the agent definitions, which means the pattern does not survive staff turnover.	.claude/agents/<role>.md, checked into main. Git history is the audit trail. The same engineer who writes the orchestrator writes the subagent files.
Audit trail for orchestrated work	Terminal scrollback, optionally replayed later. Reproducing a specific run is a problem for future-you.	One PR per intent, one required status check, one rubric JSON attached as an artifact. A compliance reviewer reads PR comments, not OpenTelemetry.
Model-vendor lock	A platform orchestrator that owns the API key, the billing, and the retry policy. You trade portability for quickstart.	None. The Markdown agent file names the model (claude-opus-4-7, claude-sonnet-4-6). Swapping to a different Anthropic model is a one-line edit; swapping to Bedrock is an env var.
Leave-behind when the senior engineer goes	Vibes and a Notion doc. The first person to leave takes the context with them.	Four files on main, plus CODEOWNERS and branch protection. A new engineer can onboard by reading the four files and running claude-code --agent orchestrator once.
Cost visibility per orchestration run	A monthly Anthropic bill and a guess.	The orchestrator emits a JSON summary with token usage per subagent, written to .claude/runs/<id>.json and uploaded as a CI artifact. FinOps sees per-issue cost.
Failure behaviour on a bad subagent output	A human reviewer catches it, or does not. Both options are worse than a rubric.	The rubric blocks the merge. The orchestrator respawns targeted subagents up to three times, then escalates. Nothing bad reaches main.

Anchor fact

0.82 is the number a merge button turns on at.

The rubric at evals/rubric.yaml has a total_threshold of 0.82 and per-section thresholds that range from 0.60 on style to 0.80 on security. The reviewer subagent emits a JSON block with a weighted total; the Gate the merge step in .github/workflows/claude-eval.yml pipes it to jq -r .score, compares to 0.82 via awk, and exits non-zero if below. Branch protection on main lists claude-eval as required. That means the GitHub UI literally disables the merge button on a PR labelled claude-subagent whose score is below 0.82. That single number is the whole reason this pattern is safer to leave in a client repo than a free-form multi-agent script.

Numbers in the pattern

These are the counts that define the shape of what we leave behind. All four are parameters of evals/rubric.yaml and runbook.md in the client repo, not marketing averages.

0Files we leave behind in the client repo for Claude Code orchestration

0Minimum rubric score a subagent PR must beat to merge

0Repair cycles the orchestrator attempts before it escalates to a human

0-100xFeature velocity on owned work when the pattern is in place

0 files on main is the whole inventory. 0 is the rubric threshold the merge button respects. 0 repair cycles are the most the orchestrator attempts before it escalates to a human. The 0-100x velocity is the FDE-10X commitment on owned work; it is not a claim about unsupervised agents.

Want the four files wired into your repo, not ours?

Sixty-minute scoping call with the senior engineer who would own the build. You leave with a one-pager naming the subagent roles we would create, the rubric thresholds we would start with, and a fixed weekly rate.

Claude Code multi agent orchestration, answered

What does Claude Code multi agent orchestration actually mean in practice?

It means a parent Claude Code session uses the Task tool to spawn specialised child agents, each with its own context window, tool allowlist, and system prompt, and then merges their outputs into one pull request. The pattern only becomes production-safe when every subagent PR runs through a CI rubric that can block the merge. The rubric, the subagent definitions, the workflow, and the operator runbook live as four files inside the client's repo; the orchestrator is a verb, the four files are the noun.

Why do you gate Claude Code subagent output with a rubric in CI instead of trusting the human reviewer?

Because a human reviewer is the wrong bottleneck for parallelised agent work. When one issue spawns four subagents and the orchestrator opens a single PR with a multi-file diff, the reviewer is effectively being asked to scan the output of four agents at once in the same attention budget they used to spend on one engineer's PR. The rubric moves the hardest parts of that judgement (per-section correctness, security, domain invariants) into a structured score the reviewer can trust. The reviewer still approves, but they are approving a diff that has already beaten 0.82 on a rubric they wrote.

What exactly goes in the .claude/agents/<role>.md file?

YAML frontmatter plus a system prompt. The frontmatter has `name`, `description`, `tools` (an allowlist, explicitly scoped), and `model`. The body is the subagent's system prompt, which encodes what it is allowed to do, what it is explicitly not allowed to do, and the output format it must produce. The reviewer subagent we ship is about forty lines and ends with an explicit prohibition on recommending merges below the rubric threshold. Having the prohibition in the prompt is load-bearing; it is the part a compliance reviewer reads first.

How is this different from Claude Code Agent Teams, the new experimental feature?

Agent Teams lets multiple Claude Code sessions run in parallel and talk to each other directly, with a team lead coordinating. It is a runtime feature. Our pattern is an engineering pattern: subagents as Markdown files, a CI workflow as the gate, a rubric as the contract, a runbook as the handoff artifact. The two are compatible — you can run Agent Teams behind the rubric gate, you just enable CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS in settings.json and treat the team lead as the orchestrator. The pattern here is what you would leave behind in the repo regardless of which Claude Code runtime is enabled.

Does the rubric gate run for every PR or only for orchestrated ones?

Only for PRs labelled `claude-subagent`. The label is applied by the orchestrator when it opens a PR; a human-authored PR does not get the label and skips the eval workflow. This keeps the rubric focused on the failure mode it was designed for, which is multi-agent diff sprawl, and it keeps human PRs on their existing review path. The branch protection rule names claude-eval as a required status check only when the label is present.

What happens when a subagent PR fails the rubric?

The workflow posts a PR comment naming the score, the failing sections, and the specific complaints from the reviewer subagent. The orchestrator reads the comment, decides which of its subagents can target the failing sections, respawns only those, and re-opens the PR with the new diff. Three repair cycles are allowed. If the third cycle still fails, the orchestrator pings the named engineer via the runbook's escalation entry and stops. No PR merges on a repaired-but-still-failing rubric; the 0.82 threshold is the same whether it is the first or the third attempt.

Why is the rubric stored in the repo and not in a SaaS evaluation platform?

Portability and auditability. The rubric describes what a correct change looks like in this product; it is a higher-value artifact than most unit tests because it encodes judgment, not just mechanics. Keeping it in the repo means the git history records who changed the rubric and why, the pull request that changed the rubric is reviewable like any other, and an engineer who leaves the team cannot take the rubric with them. It also means we do not sell our clients a platform dependency; the whole Claude Code orchestration pattern runs on GitHub, Claude Code, and the Anthropic API, with no fourth vendor.

How does the cost of running the rubric compare to the cost of the orchestration itself?

The rubric invocation is a single call on the reviewer subagent, and that call typically runs against a diff rather than the whole repo, so it is a small fraction of the orchestration budget. In our client engagements, the reviewer pass usually runs between 20K and 60K input tokens and produces a few thousand output tokens of JSON. The orchestration itself is usually 5-15x that, depending on how many subagents the orchestrator spawned and how much code each one read. FinOps sees both numbers on the per-run JSON artifact.

Can you run Claude Code multi agent orchestration against our private repo without exfiltrating code?

Yes. The CI workflow runs in the client's GitHub Actions runner, inside the client's VPC if they use self-hosted runners, with their Anthropic API key (or Bedrock endpoint, via the Bedrock integration). The code leaves GitHub Actions only to hit the model endpoint, which is the same network boundary the rest of the client's Claude Code usage already crosses. No third-party orchestration platform sits in the middle. For clients on Bedrock with PrivateLink, traffic does not leave the VPC.

What is the leave-behind when an FDE-10X Claude Code orchestration engagement ends?

The four files on main: `.claude/agents/<role>.md` for each subagent role, `.github/workflows/claude-eval.yml`, `evals/rubric.yaml`, and `runbook.md`. Plus CODEOWNERS on the four paths, branch protection requiring claude-eval as a status check on labelled PRs, and a 90-minute handoff session with the on-call rotation. The named engineer who wrote them stays available for 12 months after the engagement for paid two-hour consults at a published rate. That is the whole inventory; there is no vendor-hosted runtime to cancel and no platform subscription to renew.

Adjacent guides

Spawning subagents is the easy half. The gate is the hard half.

The SERP gap: everybody teaches spawn, nobody teaches gate

The four files we leave behind

1. .claude/agents/<role>.md

2. .github/workflows/claude-eval.yml

3. evals/rubric.yaml

4. runbook.md

Anchor file 1: the reviewer subagent

Anchor file 2: the CI gate

How a subagent PR flows through the gate

Orchestrated PR lifecycle, gated by claude-eval

Anchor file 3: the rubric

Probing a live run

The six-step lifecycle, issue to merged PR

1. Intent arrives in an issue

2. Orchestrator plans and splits

3. Subagents run via the Task tool

4. A single PR is opened and labelled

5. CI runs the reviewer subagent, then the rubric

6. A human approves or the orchestrator repairs

Spawn-and-hope vs gated orchestration

0.82 is the number a merge button turns on at.

Numbers in the pattern

Want the four files wired into your repo, not ours?

Claude Code multi agent orchestration, answered

More on shipping production AI

Multi agent orchestration, framework choice

AWS multi agent orchestration: 3 runtimes, 1 we ship

The 6-week FDE engagement model

Comments ()

Spawning subagents is the easy half. The gate is the hard half.

The SERP gap: everybody teaches spawn, nobody teaches gate

The four files we leave behind

1. .claude/agents/<role>.md

2. .github/workflows/claude-eval.yml

3. evals/rubric.yaml

4. runbook.md

Anchor file 1: the reviewer subagent

Anchor file 2: the CI gate

How a subagent PR flows through the gate

Orchestrated PR lifecycle, gated by claude-eval

Anchor file 3: the rubric

Probing a live run

The six-step lifecycle, issue to merged PR

1. Intent arrives in an issue

2. Orchestrator plans and splits

3. Subagents run via the Task tool

4. A single PR is opened and labelled

5. CI runs the reviewer subagent, then the rubric

6. A human approves or the orchestrator repairs

Spawn-and-hope vs gated orchestration

0.82 is the number a merge button turns on at.

Numbers in the pattern

Want the four files wired into your repo, not ours?

Claude Code multi agent orchestration, answered

More on shipping production AI

Multi agent orchestration, framework choice

AWS multi agent orchestration: 3 runtimes, 1 we ship

The 6-week FDE engagement model

Comments (••)

Comments ()