Build, comparison: AX tree vs screenshots for production agents

Two perception strategies, one numeric clause in your rubric.yaml that decides which one a case runs on.

Every other guide on this topic explains that the accessibility tree is cheaper and the screenshot is more robust to layout obfuscation. That is true and not enough. The part that survives engineer turnover is a per case perception_mode field in your eval rubric, a CI gate that refuses inconsistent flips, and a runbook section a different on call engineer can read in week 7. Below is the file shape, the contract clauses, and the public benchmarks the defaults are pinned to.

M
Matthew Diakonov
14 min read

Direct answer, verified 2026-05-04

Should AI agents use the accessibility tree or screenshots?

  • Accessibility tree first. A clean snapshot of a typical page is roughly 4,000 tokens. The equivalent screenshot is roughly 50,000 tokens. On a 20 step workflow that gap compounds into both latency and bill.
  • Reliability gap is 12 to 17 percentage points on common browser tasks. Public benchmarks put DOM driven stacks (Playwright plus Claude, Stagehand, Browserbase) at 89 to 92 percent task success. Vision native stacks (Anthropic Computer Use, OpenAI CUA) sit at 75 to 78 percent.
  • Screenshot is the right answer for canvas only apps, badly marked up forms that fail a WCAG audit, image driven dashboards, and anti bot screens that scramble or strip the DOM.
  • Production answer is hybrid: ax_tree primary, vision fallback per case, with the decision scored as a numeric field in your eval rubric and read by a required CI gate.

Public benchmark sources verified 2026-05-04: Firecrawl best browser agents 2026 (token costs and reliability), Anthropic on evals for AI agents.

4.9from Six file leave-behind shipped on every fde10x agent engagement
rubric.yaml carries perception_mode per case from week 1
pilot-gate.yml refuses inconsistent flips, fires the refund webhook on a Monday miss
Runbook names the selector and failure signature on every screenshot_only case

The numbers, drawn from public benchmarks

These four numbers are the spine of every defensible perception decision in 2026. They come from public reproductions of common browser tasks across the major DOM driven and vision native stacks. We pin our rubric defaults to them and re run the comparison every quarter on the client's actual case mix.

0 tokensAX tree snapshot, typical page
0 tokensScreenshot, same page, useful detail
0 to 17 ppReliability gap, DOM over vision
0xLatency advantage, AX tree on a 20 step task

Sources for the gap: public reproductions of common browser automation tasks put Playwright plus Claude at 92 percent, Stagehand at 89, Browserbase at 90, Anthropic Computer Use at 78, and OpenAI CUA at 75. Per task cost on the DOM stack is $0.02 to $0.10, on the vision stack $0.20 to $0.40. Both ranges are reproducible against the same scenario suite.

Side by side, on the dimensions that actually matter for production

Left: a production agent that runs accessibility tree primary with a hybrid fallback, scored per case in CI. Right: a production agent that runs vision only, with the choice baked into agent.py. Same workflow, same model provider, same eval cases.

FeatureScreenshot only, vision native stackAX tree primary, hybrid fallback (fde10x default)
Tokens per page snapshotScreenshot, roughly 50,000 tokens for the same page when sent at a useful detail tier. The model spends most of its budget on pixels that have nothing to do with the task.Accessibility tree, roughly 4,000 tokens for a typical content page. The model reads ARIA roles, labels, and DOM hierarchy as text. No image tokens, no detail tier surcharge.
Reliability on common tasks75 to 78 percent for vision native stacks (Anthropic Computer Use 78, OpenAI CUA 75). The 12 to 17 point gap is consistent across reported task suites.89 to 92 percent task success in public benchmarks for DOM driven stacks (Playwright plus Claude 92, Stagehand 89, Browserbase 90).
Cost per task$0.20 to $0.40 per task for Computer Use class agents. A 20 step run regularly clears 1M input tokens before tool calls.$0.02 to $0.10 per task on Playwright plus Claude at scale, because the input is tiny. A 20 step run rarely crosses 100k input tokens.
Latency per stepSlower per step, with most of the wall clock spent on the vision pass. The pattern shows up clearly when you put the two stacks behind the same scenario in your eval harness.3 to 5x faster on average. The agent skips image encode, decode, and the vision model preroll on every step.
Where it breaksVisually busy pages where many regions look the same, single page apps that re render mid action and shift the visible coordinates, OCR errors on small text. Vision is robust to layout obfuscation but still wrong on its own failure modes.Canvas only apps (rich whiteboards, design tools), pages whose interactive nodes have no role or label, anti bot screens that strip or randomise the DOM, image inside iframe content. On those, ax_tree fails silently or returns a useless tree.
How the choice survives handoffLives in a comment, a Notion doc, or oral history. When the original author leaves, every flip back and forth costs another debugging cycle.Wired into rubric.yaml as a per case perception_mode field. CI gate reads it. A new engineer in week 7 can git blame which case forced vision and why, without asking the original author.

Anchor fact: the perception block in rubric.yaml

The standard fde10x leave-behind is six files on the client's main branch: rubric.yaml, eval/cases.yaml, .github/workflows/pilot-gate.yml, .github/workflows/production-gate.yml, runbook/<agent>.md, and flags/<agent>.yaml. On a UI driving agent, rubric.yaml gains the perception block below. The names of the fields are stable across engagements, so an engineer who has read one client's rubric can read another client's cold.

Anchor fact

Four fields that turn the AX tree vs screenshot debate into a file you can git blame

  1. perception.default_mode. The mode an unmarked case runs on. Default ax_tree_only.
  2. perception.fallback_mode. What the agent does when the tree comes back empty. Default hybrid (try ax_tree, then screenshot in the same step).
  3. perception.budgets. The max_input_tokens_per_step a case is allowed to spend in each mode. The CI gate asserts case.token_budget matches the budget for the case's mode.
  4. perception.rules. The closed list of perception_reason values that justify flipping a case to screenshot_only. Pull request that flips a case without one of those reasons fails the lint-rubric step in pilot-gate.yml.

The rubric.yaml shape

Below is a verbatim excerpt from a fde10x engagement rubric.yaml (client redacted, agent name kept). The perception block is the new addition for UI driving agents in 2026. The rest of the file is the same shape we have published before for non UI agents.

rubric.yaml (excerpt)

How the CI gate enforces it

pilot-gate.yml runs on every PR that touches rubric.yaml, eval/cases.yaml, or agents/. It also runs on a Monday 09:00 UTC cron snapshot, and a failure on the cron is what fires the refund webhook. The perception_consistency job is the part that makes the ax_tree vs screenshot decision durable. It refuses three patterns: a flip without a perception_reason from the closed list, a flip without a paired token_budget edit, and a screenshot_only case whose runbook section does not name the selector and the failure signature.

.github/workflows/pilot-gate.yml (excerpt)

Same agent, two repo states

Toggle to see what a buyer's repo looks like under each choice. The agent does the same job on the same workflow. The cost line and the regression triage time are different because the perception decision lives in different places.

Same agent, two perception strategies

Same agent, same workflow, screenshot pipeline only. The eval harness has no perception_mode field, so the choice lives in a comment in agent.py. Production p95 sits around 14 seconds per step. Cost is $0.32 per task at the current case mix. When a regression hits, two engineers spend a day deciding whether to switch a case to DOM, because nothing in the repo says which cases were screenshot only on purpose.

  • p95 14s per step, 3 to 5x slower than ax_tree
  • $0.32 per task at the current mix
  • perception choice lives in comments, not in CI
  • regression triage requires the original author
  • no clean way to budget tokens per case

The runbook section that pays for itself in week 7

The whole point of writing perception_mode and perception_reason into the rubric is that a different on call engineer can decide, cold, whether a regression is in the agent or in the upstream UI. The runbook section below is the cheapest way to make that work. Each screenshot_only case gets a sub block: selector, why screenshot, failure signature, and what to verify when the case starts failing.

runbook/discharge_summary_drafter.md (excerpt)
4

Four fields in rubric.yaml (default_mode, fallback_mode, budgets, rules), one CI job that refuses inconsistent flips, one runbook section per screenshot_only case. That is the operational answer to the AX tree vs screenshot question. Anything less than this lives in a code comment, and a code comment loses the engineer who wrote it.

fde10x rubric.yaml v2 perception block, applied across UI driving agents in production

When screenshot is actually the right default

We default to ax_tree because most case mixes are 70 to 90 percent ax_tree friendly, and the cost and reliability math is decisive on those. We do not default to ax_tree on every project. There are three case mixes where screenshot_only is the right baseline.

First, design tools and creative apps. If the workflow lives inside a canvas (Figma, Miro, a 3D viewer, a charting tool) the accessibility tree is empty by construction. There is nothing to reason over. The right baseline is screenshot_only with hybrid allowed only for the toolbar steps that do live in the DOM.

Second, mature anti bot deployments. A handful of vendor portals randomise role and id attributes per page load and ship a layout update on a monthly cadence. The accessibility tree comes back as roles=presentation only, ax_tree always fails. We make the case screenshot_only and write the failure signature into the runbook so on call does not waste two days re trying ax_tree.

Third, the visual diff workloads. If the job is "tell me what changed on this page since last Monday," the ax_tree is the wrong primitive. The model wants two screenshots and a question. That is its own perception_mode, screenshot_only, with a max_steps_per_task of 1 in the budget block.

Want this perception block landed on your main branch in week 1?

60 minute scoping call with the senior engineer who would own the build. You leave with a one pager naming the agent, the rubric thresholds, the perception_mode default, the cases that need screenshot, and the named engineer who will deliver.

Accessibility tree vs screenshots, the questions teams actually ask in 2026

Should production AI agents use the accessibility tree or screenshots?

Accessibility tree first, screenshot only as a fallback. The token math drives the default: a clean accessibility tree snapshot is roughly 4,000 tokens for a typical page, and the equivalent screenshot at a useful detail tier is roughly 50,000. The reliability math reinforces it: public benchmarks put DOM driven stacks (Playwright plus Claude, Stagehand, Browserbase) at 89 to 92 percent task success on common workflows, and vision native stacks (Anthropic Computer Use, OpenAI CUA) at 75 to 78 percent, a 12 to 17 point gap. Use screenshot only for canvas only apps, badly marked up forms, anti bot screens, and image content inside iframes. The right architectural answer is hybrid: ax_tree primary, vision fallback per eval case, with the choice scored as a numeric field in your rubric and enforced by a CI gate.

Why is the accessibility tree so much cheaper than a screenshot?

A page that has 200 interactive nodes serialises into about 4,000 tokens of role, name, value, and hierarchy text. The same page rendered as an image at a detail tier good enough for the model to read body copy is roughly 50,000 input tokens, and that count grows with viewport size and density. On a 20 step workflow the difference is the difference between an 80,000 token run and a 1 million token run. The model also skips image encode, decode, and the vision branch on every step, which is why DOM driven agents are 3 to 5x faster per step in practice.

When does screenshot beat the accessibility tree?

Four real cases. (1) Canvas only UIs: a charting library, a whiteboard, a 3D viewer. There are no semantic nodes for the model to reason over. (2) Forms that fail a WCAG audit so badly the tree is empty or full of role=presentation elements; common in vendor portals. (3) Anti bot pages that strip or randomise role and id attributes between loads. (4) Image content inside an iframe the page intentionally protects from scraping. Outside those four, screenshot is paying 12x in tokens and 12 to 17 percentage points in reliability for no benefit.

How do you make the perception choice survive engineer turnover?

We do not write it in a code comment. We add a perception_mode field to every row in eval/cases.yaml, with allowed values ax_tree_only, hybrid, and screenshot_only. Each screenshot_only case must carry a perception_reason from a closed list (canvas_only_ui, anti_bot_dom_stripped, iframe_image_payload, wcag_unmarked_form). pilot-gate.yml is a required check that refuses to merge a flip from ax_tree_only to screenshot_only without a paired token_budget edit, and refuses to land any screenshot_only case unless the runbook has a vision-fallback section that names the selector and the failure signature. A new engineer in week 7 can git blame the row and read why the case is on vision, without asking the original author.

Where does this fit in a forward deployed engagement at fde10x?

It is one section in rubric.yaml and one block in pilot-gate.yml, both files we already publish under the standard six file leave-behind (rubric.yaml, eval/cases.yaml, .github/workflows/pilot-gate.yml, .github/workflows/production-gate.yml, runbook/<agent>.md, flags/<agent>.yaml). On a UI driving agent, the perception block is part of week 1 scoping. The first PR usually adds the perception_mode field, the budgets table, and the lint-rubric action. Week 2 the eval threshold and the perception consistency rule both run on the Monday 09:00 UTC snapshot, and the refund clause fires on either of them missing.

Does Playwright MCP actually use the accessibility tree?

Yes. Playwright MCP captures accessibility snapshots rather than screenshots by default, which is why it works with non vision models and is faster and cheaper to run at scale. Browser-use captures both representations and lets the model reason over each, which is the reference implementation of the hybrid pattern. Stagehand sits in the same family. The split between the DOM driven family and the vision only family (Anthropic Computer Use, OpenAI CUA) is the architectural fault line that produces the 12 to 17 point reliability gap.

What about anti bot pages? Does ax_tree even work?

Often it does not, which is the case the vision fallback exists for. A real partner portal we worked behind randomises ids and strips role attributes on every page load, so the accessibility tree comes back as roles=presentation only. The case sits in eval/cases.yaml with perception_mode=screenshot_only and perception_reason=anti_bot_dom_stripped. The runbook section names the iframe selector and the failure signature (tool call returns element_not_actionable five steps in a row). When the vendor ships a layout pass on the second Tuesday of the month, on call opens a tracking PR labeled vendor-drift the next day. The case stays on screenshot. We do not try to flip it back, because the roles will not return.

How do you set the token budget per perception mode?

Three numbers in rubric.yaml under perception.budgets. ax_tree_only sits at max_input_tokens_per_step around 6,000, which is enough headroom for a noisy page plus the conversation. hybrid sits around 60,000, which is one screenshot at a useful detail tier plus the tree. screenshot_only sits around 80,000, which allows two screenshots in a single step for diff style reasoning. The CI gate asserts that case.token_budget matches the budget for the case's perception_mode. If a developer wants a higher budget on a single case, they edit the case row, not the global default. The audit trail is in git, not in Slack.

What does this look like in the runbook?

A vision-fallback section that lists each screenshot_only case, its selector, the reason it is on vision, and the failure signature an on call engineer should look for. The point is that a different engineer in week 7, on call for an alert, can read the runbook cold and decide whether the regression is in the agent (flip a case, regenerate fixtures) or in the upstream UI (open a vendor-drift PR, do not flip the case). Without that section, every regression triage costs an hour pinging the original author. With it, the median triage on these cases drops to about 15 minutes.

What is fde10x and how is this different from buying Stagehand or Computer Use?

fde10x is the forward deployed engineering practice that ships these gates, this rubric, and this runbook into your repo. We do not sell Stagehand or Computer Use; we use whichever stack your eval rubric chooses, on whichever model provider you prefer (Anthropic, OpenAI, Bedrock, Vertex, Azure OpenAI, or open weight). What we leave behind is the operational glue that makes the perception choice durable: the rubric file, the lint-rubric action, the pilot-gate workflow, and the runbook structure. Two named senior engineers ship those into your main branch in two to six weeks, and the agent keeps shipping after we leave because no part of the runtime depends on us.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.