Guide, topic: AML agent eval harness for a regulatory audit, May 2026

What an AML agent eval harness needs for a regulatory audit, after SR 11-7 was rescinded on April 17 2026.

OCC Bulletin 2026-13 replaced SR 11-7 with a principles-based framework on April 17 2026 and explicitly put generative and agentic AI out of scope. The FFIEC BSA/AML Examination Manual still drives the exam. Five artifacts bridge the gap, each one a file or a git command, not a vendor dashboard.

Matthew Diakonov, Written with AI

Published May 5, 20269 min read

Direct answer (verified 2026-05-05 against occ.treas.gov and bsaaml.ffiec.gov)

Five files an examiner can read from your repo

Per-alert evidence pack in eval/_artifacts/alert-<id>.yaml, one per disposition, pinned to agent_version_sha and model_primary_pin.
Frozen model pin as a single line in rubric.yaml, never a moving alias.
Quarterly independent-testing report at indep-testing/<quarter>.md, mapped to the FFIEC manual independent-testing section, with the explicit overall-compliance statement.
Override log query testing for review rubber-stamping and review paralysis, both explicit BSA/AML failure modes.
Board-channel deficiency report at indep-testing/deficiencies/, tracked with corrective actions and landing dates.

Sources verified on 2026-05-05: OCC Bulletin 2026-13, FFIEC BSA/AML Manual: Independent Testing.

“Generative AI and agentic AI models are novel and rapidly evolving. As such, they are not within the scope of this guidance.”

OCC Bulletin 2026-13, April 17 2026

That sentence is the load-bearing one. The new model risk framework is principles-based and explicitly does not cover the system you shipped. The FFIEC BSA/AML Examination Manual independent-testing section was not rescinded. The harness on this page exists to satisfy the surviving authority directly.

Why this page exists

The April 17 2026 rescission was not a small editorial change. SR 11-7 had been the spine of bank model validation for fourteen years, and the 2021 Interagency Statement extended it to BSA/AML systems specifically. The replacement bulletin is shorter, principles-based, and excludes generative and agentic AI on the first page. The agencies have said an RFI on AI model risk management is coming. They have not said when.

In the meantime, FFIEC examiners still walk into BSA/AML exams with the same examination manual they used last quarter. The independent-testing pillar of the BSA program is unchanged. What is new is that the AI agent your team shipped six months ago no longer sits inside a model risk framework that names it.

This is what an AML agent eval harness has to do during the gap: produce evidence the examiner can read directly, mapped to the manual, without leaning on a model risk framework that is gone.

Five artifacts an examiner asks for

Each one is a file or a git command. None of them require a vendor account. The harness produces them as a byproduct of running the pipeline, not as a separate audit-prep exercise.

Artifact 1: per-alert evidence pack

One YAML file per disposition in eval/_artifacts/, committed on every pipeline run. Carries alert_id, agent_version_sha, model_primary_pin, the input_snapshot, the agent_output, the human_review block, and a one-command rerun_command. The examiner reads any one file and reconstructs the decision. Independent testing draws its sample from the same directory; it does not need a vendor export.

Artifact 2: model_primary_pin in rubric.yaml

Exactly one line in rubric.yaml. grep -c '^model_primary:' rubric.yaml returns 1. The pin is a frozen model snapshot ID (claude-sonnet-4-5-20250929, gpt-4-1-2026-04-22, the Bedrock model ARN), not a moving alias. When the agent version SHA was 9f3c2a7, the model the agent actually called was the value of this line at that SHA. python eval/replay.py reads that line and reproduces the decision against the same snapshot.

Artifact 3: independent-testing report mapped to the FFIEC manual

indep-testing/2026-Q2.md, generated quarterly from eval/_artifacts/. Names the testing party, asserts they were not involved in policy or training, states the testing frequency rationale tied to the institution's risk profile, gives the sample size and selection method, and ends with the explicit statement on overall BSA/AML compliance the manual asks for. Reports directly to the board's audit or risk committee. This is the document the examiner asks for first.

Artifact 4: human-override log as a first-class column

Every alert pack carries human_review.decision, override_reason, and time_to_review_seconds. The aggregate query (python eval/override_rate.py --since 2026-04-01) returns the override rate, the modal override reason, and the time-to-review distribution. The examiner uses this to test for two failures the manual cares about: review-rubber-stamping (override rate near zero with high alert volume) and review-paralysis (time-to-review climbing because alert quality is dropping).

Artifact 5: board-channel deficiency report

indep-testing/deficiencies/<YYYY-MM-DD>-<short-id>.md, one per finding, committed when the testing party files it. Carries the deficiency, severity, the corrective action committed by management, and the date the corrective action lands. The board minutes link to the file by SHA. The examiner checks that the channel exists, that deficiencies are tracked, and that corrective actions are documented (the FFIEC manual checklist almost word for word).

What artifact 1 looks like in the repo

One file per disposition. The examiner reads any one file and reconstructs the decision. The independent testing party draws a random sample from the same directory.

eval/_artifacts/alert-2026-05-04-000142.yaml

What the examiner runs to verify reproducibility

The pin in rubric.yaml plus the alert pack in eval/_artifacts/ make this a four-command exercise. We have run this in a real exam-prep session; the verification took less than ten minutes per sampled alert.

examiner-replay-session

Before and after the April 17 rescission

The harness shape does not change much; the framing in the independent-testing report changes a lot. The rescission is not a reason to ship less evidence; it is a reason to ship the evidence directly to the BSA/AML manual, not to a model-risk framework that no longer exists.

Feature	What relied on SR 11-7	What survives the rescission
Operative model risk framework	SR 11-7 / OCC 2011-12 / FIL-22-2017 are still cited on most vendor pages. They were rescinded April 17 2026.	OCC Bulletin 2026-13 (April 17 2026), explicitly principles-based and explicitly out-of-scope for generative and agentic AI. The agencies have signalled an RFI on AI MRM is coming.
What an FFIEC examiner brings to the exam	A check that an SR 11-7-flavored model validation document exists, with a vendor logo on the cover. No mapping to the BSA/AML manual sections the examiner actually reads.	The BSA/AML Examination Manual independent-testing section. It has not been rescinded. It still drives sample selection, board-reporting checks, and conflict-of-interest review for the testing party.
How the agent decision is reproducible	Agent decisions live in a vendor case-management database. The replay path goes through a vendor API that may or may not pin to the same model snapshot the agent saw.	Every alert disposition is a YAML file in eval/_artifacts/, pinned to an agent_version_sha and a model_primary_pin. python eval/replay.py --alert <id> re-runs the same model on the same input.
Independent testing artifact	A vendor PDF marked 'model validation' with no statement on overall BSA/AML compliance. Often dated, often produced by the same team that built the agent (which fails the conflict-of-interest test).	indep-testing/2026-Q2.md, generated from eval/_artifacts/, naming the testing party, scope, sample size, frequency rationale, and the explicit conclusion the manual asks for. Reports directly to the board committee.
What happens during the agentic-AI MRM gap	Wait for the RFI. Tell the examiner the framework is being updated. Ship a deck.	The harness produces evidence the FFIEC examiner can read without a model risk framework: reproducibility, sample, board-channel deficiency log. The void is bridged by being more concrete, not by guessing at the future RFI.

What the quarterly independent-testing report looks like

Generated by python eval/run_independent_testing.py at the end of each quarter. The eight sections map to what the FFIEC BSA/AML Examination Manual independent-testing section says the report should contain. Section 7 is the explicit statement the manual asks for.

indep-testing/2026-Q2.md

What this is not

This is not a substitute for legal review or for the institution's BSA officer's judgement on what their examiner expects. It is also not a forecast of the AI MRM RFI. We do not know what the agencies will publish. We know what the FFIEC manual already says, and we know what the rescission letter explicitly excludes. The harness on this page is the conservative bridge between those two facts.

When the RFI lands, the harness gets a new section in the independent-testing report. The alert packs and the rubric pin do not change.

Bridge the SR 11-7 gap with named engineers in your repo

A 30-minute scoping call returns a one-pager. Week 1 ships rubric.yaml with the model_primary_pin and the first alert pack landing in CI. Week 2 prototype runs the quarterly independent-testing generator against your own production traffic.

Frequently asked questions

What is the literal answer: what does an AML agent eval harness need for a regulatory audit?

Five artifacts the examiner can read from your repo without a vendor export. (1) A per-alert evidence pack: one YAML file per disposition, committed to git, carrying agent_version_sha, model_primary_pin, input snapshot, agent output, and human review. (2) A model_primary_pin in rubric.yaml that is a frozen model snapshot, not a moving alias. (3) A quarterly independent-testing report mapped to the FFIEC BSA/AML Examination Manual independent-testing section, with the explicit statement on overall BSA/AML compliance the manual asks for. (4) A human-override log queried by a script, used to test for review rubber-stamping and review paralysis. (5) A board-channel deficiency report tracked under indep-testing/deficiencies/ with corrective actions and dates. Each one is a file or a git command, not a dashboard.

What changed on April 17 2026 and why does it matter for an AML agent?

OCC Bulletin 2026-13 rescinded SR 11-7, OCC 2011-12, and FIL-22-2017 and replaced them with a principles-based interagency framework. The bulletin contains one sentence that is load-bearing for AI teams: 'Generative AI and agentic AI models are novel and rapidly evolving. As such, they are not within the scope of this guidance.' The agencies have said they will issue an RFI on AI model risk management later. AML agents that took dispositions on production traffic are now in an interim governance void: the framework they were validated against is rescinded, and the new framework explicitly does not cover them. The FFIEC BSA/AML Examination Manual is the surviving authority an examiner brings to the exam, and the harness has to satisfy it directly.

Why pin model_primary in rubric.yaml as a snapshot ID and not a moving alias?

Two reasons. First, reproducibility: when the examiner asks why alert-2026-05-04-000142 was dispositioned the way it was, the answer requires re-running the agent against the same model the agent actually called. A moving alias (claude-sonnet-latest, gpt-4-latest) silently rewrites history. The pin freezes the snapshot ID so python eval/replay.py reproduces the decision. Second, independent testing: the testing party draws a sample from eval/_artifacts/ across an SHA range. If the underlying model alias rotated mid-quarter, the sample is testing two different models. The pin makes the sample homogeneous.

Does the FFIEC BSA/AML Examination Manual independent-testing pillar still apply if the agent is brand new?

Yes. The manual does not condition independent testing on the underlying technology. It conditions frequency on the institution's ML/TF risk profile, requires the testing party to be independent of policy and training functions, requires the report to reach the board or a designated board committee, and requires an explicit statement on overall BSA/AML compliance. A new agent triggers an elevated cadence in the frequency-rationale section, not an exemption. The harness produces the report quarterly until the agent's risk profile drops to where annual or 12 to 18 month testing is defensible.

How does the harness test for review rubber-stamping and review paralysis?

Two queries against the override log. Rubber-stamping: override rate stays near zero while alert volume rises. The harness flags it when the override rate over a rolling 90-day window drops below the institution's calibrated floor, because that pattern means the human reviewer is approving everything the agent says without independent thought. Paralysis: time-to-review p95 climbs while alert quality (proxied by override-reason distribution) shifts toward agent-cited-rule-does-not-apply. That pattern means the agent is generating noisier alerts and the reviewer is spending longer on each. Both are explicit failure modes the BSA/AML manual cares about. The override log is the place they are visible.

Who can be the testing party for an AML agent independent test?

Per the FFIEC manual, internal audit, outside auditors, outside consultants, or qualified bank staff who are not involved in the function being tested. The party reports directly to the board or a designated board committee. The conflict-of-interest constraint is the binding one for AI agents: the team that built the agent (and wrote the prompts and the rubric) cannot also be the testing party. We separate the engagement: the build team ships rubric.yaml, eval/cases.yaml, and the alert pack format. The testing party owns indep-testing/ and runs the quarterly report against the artifacts. If the institution does not have an independent function, the manual allows qualified bank staff outside the BSA function to perform the test.

What does the FinCEN proposal of April 7 2026 change for an AML agent in production today?

Comments on the proposal closed on June 9 2026. As proposed, financial institutions get 12 months after the issuance of a final rule to comply, and examiners shift focus toward whether AML programs are operating effectively (with material or systemic failures driving enforcement) rather than isolated technical issues. The harness becomes more important under the proposal, not less, because operating-effectiveness is exactly what the per-alert evidence pack and the override-rate query measure. We are tracking the proposal week by week; if the final rule lands with material changes, the indep-testing report template gets a new section. We will not regenerate the agent, the rubric, or the alert packs.

What if my AML agent is on Bedrock with a Claude snapshot, not a direct provider call?

Same five artifacts. The model_primary_pin is the Bedrock model ARN with the explicit version (anthropic.claude-sonnet-4-5-20250929-v1:0), not the alias (anthropic.claude-sonnet-4-5). python eval/replay.py speaks Bedrock through the same provider abstraction the agent uses in production. The independent-testing report names Bedrock as the inference layer in the scope section. The alert pack carries the Bedrock invocation ID alongside the agent_version_sha so the AWS-side audit trail (CloudTrail, Bedrock invocation logs) can be cross-checked against the harness-side audit trail. We have shipped this shape on Bedrock, on Vertex, on Azure OpenAI, and on direct Anthropic. The pin syntax is the only thing that changes.

What does PIAS leave behind on day 42 that makes the audit story durable?

Five files, one cron, no vendor account. rubric.yaml with the model_primary_pin (week-1 PR). The alert pack format and eval/replay.py (week-2 PR). The eval/run_independent_testing.py generator and the indep-testing/<quarter>.md template (week-3 PR). The override-rate and time-to-review queries (week-4 PR). The .github/workflows/aml-evidence-cron.yml that re-verifies a random sample of the prior quarter's packs every Friday (week-6 PR). The 90-minute transfer session walks through each file with the named engineer and the named member of the audit function on your side. After day 42, none of these files reference PIAS; they reference your engineers, your CI, and your board committee.

What an AML agent eval harness needs for a regulatory audit, after SR 11-7 was rescinded on April 17 2026.

Five files an examiner can read from your repo

Why this page exists

Five artifacts an examiner asks for

Artifact 1: per-alert evidence pack

Artifact 2: model_primary_pin in rubric.yaml

Artifact 3: independent-testing report mapped to the FFIEC manual

Artifact 4: human-override log as a first-class column

Artifact 5: board-channel deficiency report

What artifact 1 looks like in the repo

What the examiner runs to verify reproducibility

Before and after the April 17 rescission

What the quarterly independent-testing report looks like

What this is not

Bridge the SR 11-7 gap with named engineers in your repo

Frequently asked questions

Frequently asked questions

Comments (••)

Comments ()