AI Agent Blast Radius: Why Most Production Failures Are Privilege-Scope Bugs, Not Reasoning Bugs
When an AI agent deletes the wrong row, fires the wrong API call, or pushes the wrong commit, the model is rarely the proximate cause. The agent inherited a write scope a junior engineer would never get on day one. The blast radius scales with the privilege envelope, not the reasoning quality of the model. This guide walks through how to bound that envelope, gate irreversible actions behind a dry-run plan, and design approval workflows that survive scale.
1. The bug is the privilege scope, not the reasoning
The post-mortems on agent incidents almost always frame the problem as a model error. The model misread the request, the model hallucinated a column name, the model retried a destructive call. The framing is wrong. A junior engineer would also occasionally misread a request, hallucinate a column name, or retry a destructive call. The reason juniors do not regularly nuke production is that nobody hands them DROP TABLE on day one.
Agents routinely get production write scopes a human onboarding flow would never grant. The blast radius scales linearly with the privilege envelope. A reasoning quality improvement of 30 percent does not change the worst case if the worst case is unbounded.
This reframing matters because it changes what gets fixed. Better prompts and bigger models do not bound the worst case. Better privilege scopes do.
2. The four axes of agent blast radius
Useful blast-radius bounding looks at four axes simultaneously:
- Reversibility. Can this action be undone in under 5 minutes? Most reads are reversible. Most writes are not. Bulk deletes, schema changes, and outbound emails are essentially never reversible.
- Reach. Does this action touch one row, one user, or every customer in the database? A “delete one stale session” flow that accidentally widens to “delete every session” is the canonical incident shape.
- Externality. Does the action send something a third party will see (email, Slack, payment, public commit)? External effects multiply blast radius because they cannot be retracted.
- Cadence. Is this a one-off or part of a repeated loop? Loops compound risk. A 1 percent failure rate on a 10,000 iteration loop is 100 incidents.
Every irreversible action with global reach, external effects, and loop cadence should require human approval, full stop. The interesting design space is where the action is two of those, not all four.
3. Dry-run plans for irreversible actions
The pattern that consistently works across production agent deployments: every irreversible action gets gated behind a dry-run plan that prints the actual SQL, API call, or shell command before it fires.
- Plan first, execute second. The agent generates a structured plan describing what it intends to do. The plan is rendered as plain text or structured JSON, not as natural-language description.
- Plan diffs against current state. For database operations, the plan includes the exact rows that would be affected. For file operations, the unified diff. For API calls, the request body.
- Approval is on the plan, not on the agent's reasoning. Approvers look at what will actually happen, not at the chain of thought that led there. This decouples the audit from the model's explanation quality.
- Execution is gated on a stable plan id. The agent cannot regenerate a different plan and execute it. Approval signs the plan; execution requires the signed plan id.
The dry-run pattern moves the trust boundary from “trust the model” to “trust this specific generated plan,” which is a much smaller surface to verify.
4. Approval flows that actually scale
Approval flows that require a human to read every plan eventually fail because humans get fatigued. The approval flow design that scales:
- Auto-approve low blast radius actions. Reversible reads, single-row writes within a known partition, and idempotent calls do not need a human in the loop. Just log them.
- Batch medium blast radius actions. Hundreds of similar small writes get approved as a batch with a sample of the diffs and a count of total rows.
- Block on high blast radius actions. Schema changes, bulk deletes, outbound emails to customers, payment-side effects all require explicit approval per action.
- Track approval rate per agent. An agent that has 95 percent of its plans approved as-written gets more autonomy. An agent at 60 percent gets harsher gating until the underlying issue is fixed.
The approval workflow is itself a system that needs metrics, dashboards, and improvement targets. Treat it as production software, not a side feature.
5. Observability that catches what dry-run misses
Even with dry-run plans and tiered approvals, things still go wrong in production. The observability layer that catches the residual failures:
- Action-level tracing. Every action the agent takes is logged with the originating plan, the approval state, and the actual database/API result.
- Anomaly detection on action volume. An agent that normally writes 20 rows per hour suddenly writing 20,000 should trip an alert before the hour is over.
- Side-effect counters. Outbound emails, payments, and external API calls have hard counters with daily caps that fail closed if exceeded.
- Per-tenant rate limits. An agent that runs across many tenants needs per-tenant rate limits, not just global ones, to prevent a single-tenant failure from cascading.
6. Getting started: a one-week implementation
A practical one-week implementation for an existing agent in production:
- Day 1-2: classify all actions by blast radius. Map every tool the agent has into the four axes. Be honest about what is actually irreversible.
- Day 3: separate planning from execution. Refactor the agent loop so every action passes through a plan generator before the executor.
- Day 4: build the approval queue. Even a simple Slack-based approval flow works. Approvals are on plan ids, not free text.
- Day 5: add the side-effect counters and rate limits. Per-tenant, per-agent, per-action-class.
- Day 6-7: backfill the action trace store and build the volume anomaly detector.
The team that ships this scaffolding has a fundamentally different operational posture than the team that ships an agent and hopes the prompt is good enough. The first team can sleep through a model regression. The second cannot.