Browser MCP, FDE engagement notes

Browser MCP to a production desktop agent: the seven services Claude Desktop hides from you

Your Playwright MCP demo works because Claude Desktop is doing seven jobs you never wrote: stdio supervision, OS keychain secrets, tool approval UI, conversation compaction, model and billing wiring, signed updates, and identity. Ship the same MCP server inside your own downloadable binary and all seven disappear. The agent loop is not the work. The host is the work.

Matthew Diakonov, Written with AI

Published May 6, 202611 min read

Direct answer (verified 2026-05-06)

You ship a browser MCP prototype as a real desktop agent by replacing Claude Desktop as the host. The prototype gets seven runtime services from Claude Desktop for free: stdio process supervision over the MCP transport, OS keychain storage of sensitive config, the per-call tool approval dialog, conversation compaction when the model context ceiling is hit, the model API key and billing surface, signed binary updates, and the signed-in user identity. Production rebuilds all seven in your own host process, then adds the leave-behind harness (eval cases, merge gate, runbook). Plan six weeks. The runtime-services list is verifiable against Anthropic's desktop extensions documentation; the token-cost shape behind the compaction step is from arXiv:2511.19477.

Where the host sits, before and after

In the prototype, Claude Desktop sits between the user and the MCP server. You wrote the MCP server. You did not write the host. When you ship a downloadable desktop agent, you write both, and the agent loop turns out to be the smaller of the two.

Prototype: Claude Desktop is the host

Production: your binary is the host

The seven services Claude Desktop quietly does for you

Open Claude Desktop's manifest documentation and read the field list. Each one corresponds to a runtime service the host is responsible for. The MCP spec does not cover any of them. The MCP server you wrote does not contain any of them. When Claude Desktop stops being the host, all seven become files in your repo.

Prototype host vs production host

Same MCP server, two different hosts. The right column is what week 6 of an FDE engagement ships.

Feature	Claude Desktop as the host	Your downloadable desktop agent
Process supervision (stdio transport, restart on crash)	Claude Desktop's helper supervises the MCP server over stdio. You write nothing.	host/runtime.ts spawns the MCP child, watches exit codes, EPIPEs, hangs, and rate-limits restarts so a crash loop does not eat the user's CPU.
Sensitive config storage	manifest.json field marked sensitive: true is auto-encrypted to macOS Keychain or Windows Credential Manager.	Direct keytar (or platform-native API) call. Per-user keychain entries, plus a recovery flow when the user changes their OS password and the keychain re-prompts.
Tool approval UI	Claude Desktop pops the per-tool approval dialog before each unsafe call. The user is the safety boundary.	Your decision: per-call dialog, allowlist by tool, or fully sandboxed with programmatic constraints. The choice belongs in the host, not in the agent loop.
Conversation compaction	Claude Desktop summarizes silently when the model context ceiling is hit. The user never sees it.	host/compactor.ts. A 20-action Playwright MCP workflow accumulates 200k+ tokens of accessibility-tree snapshots if you do nothing. The compactor is the difference between a $0.40 workflow and a $4.00 workflow.
Model API key + billing	User pays Anthropic for Claude Desktop. Your extension never touches an API key.	Two real choices: bring-your-own-key (every user signs up, you collect zero margin) or hosted-key (you proxy and meter per user). Both add a billing surface that did not exist in the prototype.
Signed binaries + autoupdate	Claude Desktop self-updates and pushes new manifest versions to extensions automatically.	electron-updater (or Sparkle on macOS), code signing, notarization for the macOS gatekeeper, an updates server, a release channel scheme, and a rollback story when v0.4.1 corrupts a user's profile directory.
Identity layer	Claude Desktop knows the signed-in user. Identity is implicit in the seat.	Nothing until you build it. OAuth, magic link, or a license server. Observability cannot attribute a session, entitlements cannot be enforced, and support cannot reproduce a bug without an identity primitive.

What the prototype's manifest.json gave you for free

Five lines of JSON in a Desktop Extension manifest replace what takes hundreds of lines in your own host. The snippet below is the file Anthropic ships in their mcpb examples directory. sensitive: true is the entire keychain integration. mcp_config is the entire stdio supervision. user_config is the entire configuration UI.

manifest.json (prototype, runs inside Claude Desktop)

The file below is the equivalent code that has to exist in your shipped agent's host. Eight responsibilities, none of them the agent. Identity is the eighth and is technically not in the seven services Claude Desktop provides; we put it in the host because without it the other seven cannot attribute a session.

host/runtime.ts (production, your own binary)

The token-burn trace that decides whether you can ship at all

Standard browser MCP servers (Playwright MCP, Chrome DevTools MCP) retain the full conversation history per turn. Each accessibility-tree snapshot is roughly 10,000 tokens. A 20-action workflow accumulates over 200,000 tokens of model context. In Claude Desktop the user is paying this on their seat plan, so the number is invisible. In your shipped desktop agent it lands in your Anthropic invoice. On Claude Sonnet input pricing that is roughly $0.60 per workflow before output, before the user runs the agent twice in a row.

mcp-token-trace.mjs (one workflow, no compactor)

host/compactor.ts is the file that turns the linear curve flat. We usually land it in week 3 of the engagement and watch the average tokens-per-workflow drop 60 to 80 percent on the same eval/cases.yaml that ran the prototype. The eval gate blocks any PR that improves the token count by breaking the agent.

The six-week sequence we run on FDE engagements

The order matters. Process supervision and secrets land first because nothing else works without them. The agent loop comes second because it lets you reproduce the prototype inside your own binary. Compaction comes third because the eval gate has to exist before the compactor can prove it did not regress the agent. Identity, approval UI, and billing surface land in weeks 4 and 5. Signed updates lands at week 6 if it was scoped, or slips to v0.5 if it was not.

Week 0 -- scoping call and one-pager

30 minute call. Output is a written one-pager that names which of the seven services your shipped agent has to rebuild, which the prototype already needed (almost always: compaction and identity), and which are deferred to v2 (almost always: signed updates, until you have v0.5).

The one-pager also names the senior engineer who joins your repo on day 7, fixes the model provider for v1 (vendor neutral, but pinned for the engagement), and lists the leave-behind artifacts: host/runtime.ts, host/compactor.ts, eval/cases.yaml, .github/workflows/pilot-gate.yml, runbook.md.

Week 1 -- first PR, host scaffold and stdio supervision

First PR ships the host process and the stdio supervisor. The MCP server you were running by hand under Claude Desktop now runs as a child process under your code. Crash, exit, EPIPE, hang are all handled. No agent loop yet; the host can list tools and route a single tool call.

If we miss week 1, billing pauses. The point of week 1 is not the agent. It is removing Claude Desktop from the picture without the agent breaking.

Week 2 -- prototype in your staging, on the rubric

The prototype runs the same workflow your Claude Desktop demo ran, but inside your own host. Eval cases live in eval/cases.yaml. The merge gate (pilot-gate.yml) blocks any PR that drops the agreed rubric. If the prototype misses week 2 on the agreed rubric, the engagement refunds and exits.

The prototype rubric is five axes: faithfulness, helpfulness, completeness, tone, policy. Floor scores are agreed in writing in week 0. The rubric.yaml file is the contract.

Weeks 3 to 5 -- compaction, identity, approval UI, billing surface

The four services that the v1 ship date depends on. Compaction reads the trajectory and decides what to summarize before the next turn. Identity is a thin OAuth or license check. Approval UI lands as either a dialog or an allowlist. Billing surface is bring-your-own-key plus a hosted-key fallback.

Each PR ships behind the same merge gate. The week 3 PR that adds compaction usually drops average tokens-per-workflow by 60 to 80 percent on the same case set; that is the eval the gate runs.

Week 6 -- production handoff, runbook, transfer session

The leave-behind: host/runtime.ts, host/compactor.ts, eval/cases.yaml, eval/judges/*.md, .github/workflows/pilot-gate.yml, runbook.md, an architecture doc, plus a 90 minute transfer session with the engineer who shipped it. Client owns the repo, the eval harness, and the runbook. No platform license. No vendor-attached runtime.

Signed updates and the rollback story usually slip to week 7 or 8 unless the engagement scoped them in week 0. We say so on the call.

Two ways to call the demo shipped

The temptation is to ship the .mcpb extension your prototype already runs against, plus a one-page setup guide. The user installs Claude Desktop, drags your extension in, pastes their Anthropic key, and the demo works. You called it shipped. You shipped a Claude Desktop tutorial.

Users have to install and pay for Claude Desktop themselves
You cannot brand the experience or measure usage in your own analytics
The day Anthropic changes manifest schema, your install instructions break
You cannot run the agent unattended; the user has to be in front of Claude Desktop

Why the gap is six weeks, not six days

The reason a working prototype in Claude Desktop does not become a shipped desktop agent in a weekend is that the seven services are not features you can put behind a flag. They are runtime invariants. The host either supervises the MCP child or it does not; the keychain integration either survives a macOS keychain re-prompt or it crashes the app on first launch; the compactor either holds the token curve flat across a 50-action workflow or you find out at the end of the month from your Anthropic invoice. Every one of the seven has a failure mode in the wild that the prototype never showed you, because Claude Desktop was catching it.

Six weeks is the timebox we agree in week 0 because the eval gate forces the rebuild to happen in graded slices, not as one terrifying merge. Week 1 PR replaces the host process. Week 2 PR puts the agent loop on the rubric. Week 3 PR adds compaction and measures it. Weeks 4 and 5 add identity, approval UI, and the billing surface, each behind the same gate. Week 6 is the handoff. If a slice misses, the engagement refunds and exits; we do not extend a missed week 2 into a sympathy week 7.

Have a browser MCP prototype that needs to ship as a real desktop agent?

A 30 minute scoping call gets you a written one-pager naming which of the seven services your agent has to rebuild, which can defer to v0.5, and what the week 2 rubric would look like for your specific workflow.

Frequently asked

Frequently asked questions

Why does my browser MCP demo work in Claude Desktop but break the moment I try to ship it?

Because Claude Desktop is the host, not the runtime. It supervises the stdio transport to your MCP server, encrypts your API key in the OS keychain, prompts the user before each tool call, compacts the conversation when the context ceiling is hit, pays the model bill out of the user's seat, ships signed binary updates, and knows which user is signed in. None of those services are part of the MCP spec or your server code; they live in the host. When you ship the same MCP server in your own desktop binary, you inherit all seven jobs and the agent loop. Plan a six-week engagement to do it well, not a weekend to wrap the .mcpb.

What is the token math people miss when they move from Playwright MCP in Claude Desktop to a production desktop agent?

Playwright MCP retains the full conversation history per turn, and each accessibility-tree snapshot runs roughly 10,000 tokens. A 20-action workflow accumulates over 200,000 tokens of model context (see arXiv:2511.19477 on browser agent architecture). In Claude Desktop the user pays this on their seat plan. In your shipped desktop agent, you pay it; on the Claude Sonnet input price, that is roughly $0.60 per 20-action workflow before output. The compactor in host/compactor.ts is the difference between a sustainable workflow and a gross-margin-negative one. We usually drop tokens-per-workflow by 60 to 80 percent in the week 3 PR; the eval gate measures it on every commit thereafter.

Can I just ship my MCP server as a Desktop Extension (.mcpb) and call that production?

Only if your distribution model is 'the user installs Claude Desktop'. The .mcpb format gets you keychain-backed sensitive config, automatic process supervision, autoupdates, and a configuration UI for free, all paid for by Anthropic shipping Claude Desktop. It is the right answer for an internal tool or a published extension where Claude Desktop is the brand. It is the wrong answer for a downloadable agent that is supposed to be your product. The user wanting to install your agent would need to install Claude Desktop first, get an Anthropic key, paste it, then install your extension. That is a Claude Desktop tutorial, not a desktop agent.

What does the agent loop look like once Claude Desktop is no longer the host?

Thinner than people expect. The host does the hard part. The loop reads the next user message (or the next scheduled trigger), pulls the compacted conversation from host/compactor.ts, sends it to the model, parses the tool calls, asks host.approve() for any unsafe call, dispatches to the MCP child over stdio, gets the result, and appends to the conversation. The whole loop is usually 80 to 200 lines. The host services around it (supervision, secrets, compaction, billing, identity, updates, approval) are the four to six thousand lines.

How do you handle the browser session lifecycle when the desktop agent is the host?

The MCP server (Playwright MCP, Chrome DevTools MCP, your own) owns the browser process. The host decides when the MCP server starts, restarts, or is killed; the MCP server decides whether to reuse a browser context across tool calls. We default to one persistent browser context per agent session, with profile data under app.getPath('userData') so the user's logins and cookies survive across runs. Tear down the context when the host shuts down, not between tool calls; restarting the browser per action burns three to five seconds and a Cloudflare challenge.

Which of the seven services should I do first if I cannot do all of them in week 1?

Process supervision, then secrets, then the agent loop. With those three, your demo runs end to end inside your own binary, against your bundled MCP server, with the user's API key safely on disk. Compaction is week 3 because it requires the eval harness to know it actually saved tokens without breaking the agent. Approval UI and identity are weeks 4 and 5. Signed updates is the work item people defer to v0.5 most often because it gates production distribution but does not gate internal demo. The runbook spells out the exact order.

What is in the leave-behind harness an FDE engagement ships at week 6?

Six files. host/runtime.ts (process supervisor, secrets injector, approval, identity hooks). host/compactor.ts (the conversation compactor that keeps the per-workflow token bill flat instead of linear). eval/cases.yaml (15 to 30 cases drawn from the prototype's real traces). eval/judges/*.md (the rubric prompts). .github/workflows/pilot-gate.yml (the merge gate that blocks any PR dropping the agreed rubric). runbook.md (the on-call doc covering MCP child crashes, keychain re-prompts, autoupdate failures, and rollback). The client owns all six. No platform license, no vendor-attached runtime.

Are you tied to a specific model provider or framework?

No. We pick one in week 0 with the client (Anthropic, OpenAI, Bedrock, Vertex, Azure OpenAI, or open weight) and stay on it for the engagement, but the host is not coupled to it. host/modelClient.ts is the only file that knows which provider the agent is talking to; swapping providers in week 4 is a single-file change with the eval gate as the safety net. Same for the MCP servers: Playwright MCP, Chrome DevTools MCP, or a custom one your team writes. The host treats them all as stdio children with a tools.list and a tools.call endpoint.

The companion pieces an FDE engagement pulls into the same six weeks.

Keep reading

Companion piece

FDE Week 2 prototype rubric: the five axes and the merge gate

The rubric.yaml the prototype ships against in week 2. Five axes, weighted floors, the pilot-gate.yml workflow that blocks merges below the line.

Read

Adjacent stall

AI agent prototype to production: the regression tail

Once the desktop agent is shipped, the next failure mode is the long tail of real-user prompts the prototype rubric never saw.

Read

Pairs with

Production AI agent monitoring system

What the host writes to your observability layer once identity is in. Per-session traces, per-tool latency, the approval-deny ratio.

Read

Where the host sits, before and after

Prototype: Claude Desktop is the host

Production: your binary is the host

The seven services Claude Desktop quietly does for you

Prototype host vs production host

What the prototype's manifest.json gave you for free

The token-burn trace that decides whether you can ship at all

The six-week sequence we run on FDE engagements

Week 0 -- scoping call and one-pager

Week 1 -- first PR, host scaffold and stdio supervision

Week 2 -- prototype in your staging, on the rubric

Weeks 3 to 5 -- compaction, identity, approval UI, billing surface

Week 6 -- production handoff, runbook, transfer session

Two ways to call the demo shipped

Why the gap is six weeks, not six days

Have a browser MCP prototype that needs to ship as a real desktop agent?

Frequently asked

Frequently asked questions

Keep reading

FDE Week 2 prototype rubric: the five axes and the merge gate

AI agent prototype to production: the regression tail

Production AI agent monitoring system

Comments (••)

Comments ()