Guide, topic: AI governance, data quality, EU AI Act, financial, insurance

76 percent data-quality gap, and the artifact list that closes it.

Thomson Reuters reported a 76 percent data-quality gap in enterprise AI deployments. In mid-market financial and insurance shops the gap is the project. The eval harness plus the lineage documentation eats 30 to 40 percent of the engagement before any model code ships, and once EU AI Act high-risk classification kicks in the artifact list stops being optional. This guide is the concrete shape: which documents are required, where the data-quality binding constraint actually binds, and what a model-risk review checklist looks like in a real repo.

Matthew Diakonov, Written with AI

Published April 27, 202614 min read

Why data quality is the binding constraint, not model quality

In a financial or insurance shop the model is mostly commodity. The same foundation models are available to every competitor. The thing that decides whether the deployment ships is whether the data the model sees is fit for the decision the model is making. Underwriting decisions, credit decisions, claims decisions, and fraud decisions all live or die on data quality. The Thomson Reuters number (76 percent of enterprise deployments report a meaningful data-quality gap) is not a measurement problem; it is a structural one. The data is bespoke per company, the lineage is under-documented, the freshness pipelines have no clear owner, and the bias measurements have not been run.

Closing the gap is not a model project. It is a data engineering project plus a documentation project plus an eval project, in that order. The model code is the last 20 percent. The first 80 percent is the lineage graph, the per-source quality scorecards, the bias slices, the change-detection jobs, and the artifact list. That 80 percent is also what the EU AI Act asks for in writing once the system is classified high-risk, which is why the regulatory cost and the engineering cost converge: the same artifacts satisfy both the regulator and the operational requirement to ship a system that does not fail in production.

The 30 to 40 percent of the engagement is not overhead. It is the foundation. Skipping it is what produces the gap.

Six artifacts the EU AI Act high-risk classification requires

Each of these maps to a section of the EU AI Act high-risk requirements. Each one is also an operational artifact the team would benefit from having even if regulation did not apply. The artifacts are the same; the regulator's vocabulary is just a renamed view of the engineering work.

System documentation, with versioning

Every model deployment has a stored system-documentation artifact: what the model is, what it does, what data it was trained on, what data it operates on, what its inputs and outputs are, what its known failure modes are, what its accuracy and robustness scores are, and what its operational envelope is. The artifact is generated from the repo and the eval harness and stored at the version of the model. The EU AI Act calls this 'technical documentation'. The phrase is regulatory; the artifact is engineering.

Data governance plan, with measurable axes

Completeness, freshness, consistency, lineage coverage, and bias measurement, per source. Each axis has a threshold and an owner. The plan is in the repo. It runs on a schedule. The bias measurement is the most-skipped part and the most-required for high-risk classification, especially in insurance and lending. The plan also names how missing data is handled, how outliers are handled, and how the team knows when to retrain.

Risk management plan, tied to the eval rubric

The rubric file in the repo is the operational form of the risk management plan. It names the slices the team grades on, the thresholds for each axis, the high-stakes hard gate, the override policy, and the escalation path. Risk management is not a separate Word doc; it is the rubric, with a cover note that maps the rubric sections to the regulator's vocabulary.

Human-oversight plan, with an actual queue

Human oversight is the part most teams write a paragraph about and never operationalize. The artifact that satisfies the regulator is a documented intervention queue, an audit log of every intervention, and a documented escalation path. The override policy in the rubric is the same artifact viewed from a different angle. Every override gets logged with a timestamp, a reason, and a reviewer.

Accuracy, robustness, and cybersecurity records

The eval harness plus the regression set produces the accuracy and robustness records on a continuous basis. The cybersecurity records come from the security review of the deployment: prompt-injection slice scores, secret-leak slice scores, data-exfiltration slice scores, and the audit of the agent's privilege envelope. All three are stored as part of the deployment PR. None of them are reconstructed after the fact.

Conformity assessment and registration, kept current

For high-risk systems, the conformity assessment and the EU registration are not one-time events. They are kept current as the system changes. A model swap, a corpus refresh, or a slice expansion is a re-assessment. The artifact list in the repo includes the date of the last assessment and the diff that triggered the next one. The registration is updated when the assessment changes.

What the model-risk review checklist looks like in a real repo

The checklist sits at docs/model-risk-checklist.md. It runs before every model deployment. The deployment PR has to attach: per-slice eval-harness scores against the rubric; lineage diff against the previous deployment (which sources changed, by how much); bias slice results on protected categories (gender, age, race where lawful, geography, income bracket); override log for the previous quarter (what got overridden, who signed off, why); human-in-the-loop coverage rate (what fraction of decisions had a human reviewer, on which slices); latency and cost metrics; and security slice scores (prompt injection, secret leak, data exfiltration, privilege boundary).

The chief data officer or chief risk officer reviews and approves the PR. Every reviewer has a documented role in the checklist; nothing is approved by an unnamed person. The PR is the audit trail. Nothing in the trail is reconstructed from memory after the fact, which is the failure mode that produces six-month consulting engagements after a regulator notice.

Lineage as a first-class artifact

Lineage is the artifact regulators ask about most and the one most teams have least. The shape that satisfies both engineering and regulation: every input the model sees has a documented source, a documented owner, a documented capture date, a documented transformation history, and a documented quality score. The lineage graph is generated by the ingest pipeline and stored in the repo at docs/lineage/{source}.md plus a machine-readable graph that the harness can replay. PRs that touch ingest config also touch the lineage doc; the team treats lineage as code.

The graph also drives change detection. When a source's quality score drops below threshold, or the freshness clock runs past the budget, the lineage owner gets paged. The failure mode that the lineage graph prevents is the silent drift where a source upstream of three models stops updating and the team does not notice until a customer complaint surfaces six weeks later.

Compliant deployment vs ad-hoc deployment, side by side

Left column is the deployment shape that survives a regulator audit and a data-quality incident. Right column is the common pattern in mid-market shops: a launch deck, a verbal sign-off, and a CDO who is asked three questions on the day of deployment. The right column is faster on project one. The left column is the only thing that survives an EU AI Act inquiry.

Feature	Ad-hoc launch deck and verbal sign-off	Repo-native artifact list, lineage graph, model-risk checklist
What 'data quality' means in practice	An assertion in the launch deck. 'Data is high quality.' No measurement, no owner, no history.	A measured score per source: completeness, freshness, consistency, lineage coverage. Tracked over time, alarmed when any axis drops below threshold, owned by a named role.
Where lineage lives	In an engineer's head. Possibly in a Confluence page from 18 months ago that mentions three sources of nine.	In the repo as docs/lineage/{source}.md plus a machine-readable lineage graph. Generated by the ingest pipeline. Reviewed on PRs that touch ingest config.
What an EU AI Act high-risk classification triggers	Panic. A meeting with legal. An external consultant. Six months of consulting fees and the realization that the artifacts have to be reconstructed from incomplete records.	The artifact list (system documentation, risk management plan, data governance plan, transparency documentation, human-oversight plan, accuracy/robustness/cybersecurity records, conformity assessment, registration). Each artifact has a named owner and a stored version in the repo.
How a model risk review actually runs	An ad-hoc meeting with the chief data officer right before the deployment. The CDO asks three questions. The engineer answers from memory. The deployment ships.	A checklist in the repo, executed before every model deployment. Eval-harness scores, lineage diff, bias-slice results, override log, and human-in-the-loop coverage are all attached to the deployment PR.
What 'human oversight' means	A line item in the launch deck. 'Humans can intervene.' No queue, no log, no documented process.	A documented intervention queue with audit logs. The override policy in the eval rubric names who can override what, and every override is logged with a timestamp and a reason.
What happens during a regulator audit	Two months of reconstruction. Legal time. Consulting time. A finding that requires remediation. A second audit six months later.	The repo gets exported. The artifact list maps directly to the regulator's checklist. The audit takes a week, with the team mostly answering clarifying questions.
How the bar moves over time	Silently. Each project rationalizes a slightly lower bar. By year three the practice is materially different from year one and nobody can date the change.	The rubric and the artifact list are version-controlled. Threshold changes are PRs, reviewable by risk and compliance. The bar moves deliberately or it does not move.

Where fde10x fits

fde10x is one option for mid-market financial and insurance teams that want a senior engineer to embed for four to eight weeks and ship the eval harness, the lineage docs, the bias slices, the model-risk checklist, and the EU AI Act artifact set into the client repo. The work is collaborative across engineering, risk, compliance, and legal; the engineer drives the artifact authoring sessions and leaves the team owning the artifact list. We are not the only path; teams build this themselves. The embed is the right call when the team has been shipping by gut feel and rolling back at a measurable rate, when a regulator notice has arrived, or when the artifact list has been "we'll get to it" for two quarters and the data-quality gap keeps showing up as production incidents.

Want a senior engineer to ship your EU AI Act artifact list and the data-quality scorecards?

60 minute scoping call with the engineer who would own the build. You leave with a draft of the artifact list against your stack, the lineage graph design, the bias-slice plan, the model-risk checklist, and a fixed weekly rate to ship the eval harness, the regression suite, and the first clean PR run inside your repo.

AI governance and the EU AI Act, answered

Why does the eval harness and lineage documentation eat 30 to 40 percent of the engagement?

Because in financial and insurance shops the binding constraint is data quality, not model quality. The model is commodity. The data is bespoke. Every project starts with two to three weeks of work on the ingest pipeline, the lineage graph, the per-source quality scorecards, the bias-measurement slices, and the change-detection job that catches drift. Without that work, the eval harness has nothing reliable to score against, the rubric thresholds are guesses, and the regulator's artifact list cannot be generated from the repo. The 30 to 40 percent is not overhead; it is the foundation that the rest of the project sits on. Skipping it is what produces the Thomson Reuters 76 percent gap.

What does EU AI Act high-risk classification actually require?

Roughly nine artifacts. System documentation. Risk management plan. Data governance plan. Transparency documentation for users. Human-oversight plan. Accuracy, robustness, and cybersecurity records. Quality management system documentation. Conformity assessment. Registration in the EU database for high-risk AI systems. Each artifact has specific contents and is auditable. For mid-market financial and insurance use cases (credit scoring, insurance underwriting, fraud detection, claims triage), high-risk classification is common and the artifact list is binding. The artifact list is not optional once the classification applies, and the cost of reconstructing artifacts after the fact is materially higher than the cost of generating them as the system is built.

How do you measure data quality in a way the regulator accepts?

Per-source scorecards on five axes (completeness, freshness, consistency, lineage coverage, bias), each with a threshold, an owner, and a history. The scorecards are generated on a schedule (daily for fresh data, weekly for slow-moving data) and stored in the repo. Lineage coverage is the most-asked-about axis: the regulator wants to know that every input the model uses can be traced back to a source with a documented owner, a documented capture date, and a documented transformation history. Bias is the second-most-asked-about axis: the regulator wants to know that the system has been measured for bias on protected categories at the input layer, the model layer, and the output layer.

What does 'human oversight' look like operationally?

A documented intervention queue plus an audit log plus a defined escalation path. The intervention queue is where flagged decisions land for human review. The audit log records every intervention with a timestamp, a reviewer, a decision, and a reason. The escalation path names who reviews what and how long they have to respond. The override policy in the eval rubric is the same artifact viewed from a different angle. The thing the regulator is checking for is whether oversight is real (with a queue and a log) or hypothetical (with a paragraph in the launch deck).

Where does the model risk review checklist live?

In the repo, at docs/model-risk-checklist.md. It runs before every model deployment. Eval-harness scores per slice, lineage diff against the previous deployment, bias-slice results on protected categories, override log for the previous quarter, human-in-the-loop coverage rate, latency and cost metrics, and security slice scores are all attached to the deployment PR. The chief data officer or chief risk officer reviews and approves. The PR is the audit trail; nothing is reconstructed from memory after the fact.

Where does fde10x fit?

fde10x is one option for mid-market financial and insurance teams that want a senior engineer to embed for four to eight weeks and ship the eval harness, the lineage docs, the bias slices, the model-risk checklist, and the EU AI Act artifact set into the client repo. Common engagement: weeks 1 to 2 collaborating with engineering, risk, compliance, and legal on the rubric and the artifact list; weeks 3 to 4 wiring the lineage pipeline and the bias measurement; weeks 5 to 6 backfilling the regression set from past underwriting or claims decisions; weeks 7 to 8 connecting the harness and the artifact generation to CI. The leave-behind is the artifact list generated from the repo, owned by the team, version-controlled, and ready for a regulator export.