Guide, topic: AI execution gap, rubric, shipping gate, 2026

The AI execution gap is a missing rubric.

"Demos great, struggles in production" is a category, not a mystery. The cause is almost always the same: the team shipped against a demo rubric (one curated scenario, one binary grader, one happy path) and production runs against a different rubric entirely (every input the long tail can produce, multi-turn conversations, high-stakes failures that the demo never included). The execution gap is the distance between those two rubrics. This guide is about closing it: making the production rubric explicit, putting it in the repo, and turning the eval harness into a shipping gate that runs on every change instead of an afterthought that runs once before launch.

Matthew Diakonov, Written with AI

Published April 27, 202612 min read

What the execution gap actually is

Every team that has shipped an agent has felt the gap. The pilot looks promising. The demo gets standing ovations. Three months into production the support queue is full of complaints that the agent does not handle the cases the team thought were obvious, and the engineering team is in a meeting trying to figure out why the same model that aced the demo cannot answer the customer's actual questions. The team's instinct is to blame the model, then blame the prompt, then blame the corpus. The real cause is more boring: the team never wrote down what "production-ready" meant, so the bar quietly moved from "great on the demo cases" to "as good as the demo, on the demo cases" for the entire life of the project.

Production-ready has a precise definition. It is a rubric. The rubric has axes (faithfulness, helpfulness, completeness, tone, policy), slices (head, tail, high-stakes, adversarial, long-context), and thresholds (numeric, per axis, per slice). The rubric also has a regression rule: any case that previously passed and now fails is a hard fail, even if the aggregate looks fine. The rubric lives in the repo. It runs on every PR. It blocks merges that fail it. The shipping decision is mechanical: the rubric passes or it does not.

The execution gap closes the moment the rubric exists. Not because the agent gets better; because the team's definition of "good enough" stops moving. The arguments that used to happen in shipping meetings get encoded in the rubric file, where they can be reviewed and changed deliberately instead of decided verbally and re-decided two weeks later.

Six ways the demo rubric and the production rubric diverge

Each of these is a real shape of the gap. Each one explains a specific kind of "great demo, struggling production" outcome and each one points at a specific section of the production rubric that has to exist if the agent is going to clear the user's actual bar.

The demo had three hand-picked inputs. Production has every input.

A demo agent answers three rehearsed questions. The team knows which questions, the answers are checked, the screenshot looks great. Production users send 4,000 different questions a day, half of which the team would not have phrased that way. The demo rubric measures the 3. The production rubric has to measure the 4,000. The execution gap is the difference between those two surfaces and it is mostly invisible until users hit it.

The demo grader was generous. The production grader is the user.

In the demo, a 70 percent helpful answer counts as a win because the audience is forgiving and the next slide is loading. In production, a 70 percent helpful answer is a partial fail because the user has to either re-prompt or escalate. The grader changed without anyone updating the rubric. The agent's measured performance did not change; the user's tolerance did.

The demo was single-turn. Production is seven turns deep.

Almost every demo is a one-shot prompt-and-response. Production conversations average four to seven turns and the failure modes that appear in turn six are completely absent from the demo rubric: context drift, summary distortion, tool-state confusion, persona slip. The demo passes; the agent's actual user experience is much worse than the demo suggests.

The demo had a perfect retrieval. Production has a stale corpus.

In the demo, the team ingested fresh content the week before. In production, the corpus is six months stale because nobody owns the freshness pipeline. The agent answers questions about products that have changed, prices that have moved, and policies that no longer apply. The demo rubric never tested freshness. The production rubric must.

The demo had no high-stakes cases. Production is full of them.

Demos avoid the cases where the wrong answer costs real money or violates policy. Production cannot. The execution gap shows up most painfully here: an agent that ships against a demo rubric will quietly fail on refund eligibility, regulatory disclosure, medical accuracy, or any case where 'mostly right' is unacceptable. The production rubric needs a hard gate for these and the demo rubric never has one.

The demo was a screenshot. Production is a runtime.

The demo is captured at one point in time. The production agent is running continuously against a moving model, a moving corpus, and a moving user base. The demo rubric scores a snapshot. The production rubric has to score the system as it changes. That requires running on every PR, on every trace ingest, and on every model snapshot. The shape is fundamentally different.

Eval harness as a shipping gate, not an afterthought

Most teams treat the eval harness as a thing they ran once before launch. The harness scored a number, the team agreed the number was good enough, and then the project shipped and the harness sat dormant for months. New PRs landed, the model got swapped, the prompt got tweaked, the corpus got re-ingested, and the harness re-ran maybe once a quarter when someone remembered. By the time it ran again, the agent's score had drifted in ways the team could not attribute to any single change.

The shape that closes the gap: the harness runs on every PR, the rubric is the gate, and the gate has teeth. A PR that regresses on a high-stakes case does not merge. A PR that drops a slice score below threshold does not merge. The team gets the same kind of mechanical feedback from the harness that they get from a unit test. The rubric file is the equivalent of the test assertions: it is the team's contract with itself about what the agent has to do.

The cost of running the harness on every PR is real but bounded. A typical eval suite of 800 cases against a commodity judge model costs on the order of $5 to $20 per run, depending on how the rules layer filters and whether the suite uses cached embeddings. For a team that ships ten PRs a week, that is on the order of $400 a month. The cost of not running it is a recurring rollback every few weeks, which is much more expensive in engineering time even before counting the customer impact.

Production rubric vs demo rubric, side by side

Left column is the rubric shape we ship into client repos. Right column is what most teams have on day one: a vague sense of "good enough" that lives in shared memory and shifts as people get tired of failures. The right column is faster on the first project. The left column is the only thing that survives quarter two.

Feature	Demo rubric as memory	Production rubric as gate
What the rubric measures	Demo rubric. 'Did the agent answer the curated question correctly?' Single binary. Cases drawn from happy-path screenshots.	Production rubric. Per-axis grading on faithfulness, helpfulness, completeness, tone, policy. Cases drawn from production traces. Stake-weighted: high-stakes cases gate hard.
Where the rubric lives	In someone's head. The PM remembers what 'good enough' looked like at the launch demo and applies it inconsistently.	eval/rubric.yaml in the repo, version-controlled, loaded by CI on every PR and by the cron on every prod-trace ingest.
Who decides whether to ship	The team in a meeting. Different people, different criteria, different decisions every time.	The rubric file, mechanically. The decision rule is encoded: high-stakes hard gate must pass, no slice can regress more than the threshold, latency and cost weighed only after quality gates pass.
When the rubric runs	The Friday before the launch demo. Once. The agent then ships changes for six months without the rubric re-running.	On every PR (against the regression set) and on every prod-trace ingest (against new cases). Failing the rubric blocks the merge.
How regressions get caught before the customer hits them	The customer hits the regression first. Support escalates, engineering triages, the team relearns the failure mode.	Every promoted prod-trace failure becomes a regression case. The rubric runs against all of them on every PR. A regression on a previously-failed case is a hard fail.
What 'production-ready' means	Vague. 'Good enough.' Relative to whatever the loudest stakeholder remembers from the demo.	Specific. Defined by the rubric. 'Faithfulness 0.92 or above on the trace-fed slice, zero high-stakes failures, p95 latency under 4 seconds, cost per session under $0.08.' Numbers.
How the rubric evolves	The bar moves silently as people get tired of failures. By month four the team has redefined 'good enough' downward without anyone deciding to.	PRs to eval/rubric.yaml. Threshold changes are reviewable. The rubric is itself an artifact with a history; you can git blame the bar.

How the rubric gets written the first time

The rubric is a collaboration across engineering, product, support, and (for high-stakes use cases) legal. Each stakeholder brings a part of the bar. Engineering writes the per-axis judge prompts and the slice definitions. Product writes the slice priorities and the latency and cost ceilings. Support writes the high-stakes case list from past escalations. Legal writes the policy constraints. The output is one YAML file in the repo, usually under 200 lines, that represents what each stakeholder has agreed counts as shippable.

The first version takes about two days of meetings and one engineer's full week of writing. The result is rough and the thresholds are guesses. That is fine. The rubric is a living document; the first version exists to be calibrated against real eval runs. After two weeks of running against production traffic, the thresholds get adjusted to match where the agent actually performs, with deliberate room for improvement. After two months the rubric is stable enough to be the team's real definition of production-ready and the arguments about what to ship stop being arguments.

Critically, the rubric also names its override policy. Who can sign off on a failed gate, what kinds of failures are overridable, and what gets logged. Overrides are auditable, not forbidden. A team that overrides twice a quarter is using the safety valve correctly. A team that overrides twice a week has a rubric that does not match reality and needs a rewrite, not a more permissive gate.

Where fde10x fits

fde10x is one option for teams that want a senior engineer to embed for two to six weeks and ship the rubric file, the harness extensions, the regression suite, and the CI gate into the client repo. The work is collaborative by nature; the engineer drives the rubric authoring sessions with the stakeholders, builds the harness, wires the gate, and leaves the team owning all of it. We are not the only path; teams build this themselves all the time. The embed is the right call when the team has been shipping by gut feel and rolling back at a measurable rate, or when the rubric has been "we'll get to it" for two quarters and the gap keeps showing up as production incidents.

The leave-behind is the gate running against every PR, owned by the team, version-controlled in the repo, with a runbook for the on-call engineer when a PR fails the gate at 02:00. The work is unglamorous infrastructure that happens to be what closes the execution gap.

Want a senior engineer to author your production rubric and wire the gate?

60 minute scoping call with the engineer who would own the build. You leave with a draft of the rubric file against your stack, the slice definitions we'd start with, the override policy, and a fixed weekly rate to ship the gate, the regression suite, and the first clean PR run inside your repo.

The AI execution gap and the rubric, answered

What is the AI execution gap, exactly?

It is the distance between what an AI system can demo and what an AI system can do in production at the bar a real user requires. The gap is widest on agentic systems because the demo is almost always a curated happy-path scenario and production is the long tail. Closing the gap is not a model question; it is a rubric question and a process question. The rubric defines what production-ready means. The process makes sure every shipped change clears that bar before users see it. Most teams ship without doing either, which is why most agent rollouts feel impressive at the demo and disappointing in production.

Why is the eval harness the right place to put the gate?

Because it is the only artifact in the team's process that can be run mechanically against every change. PR review catches code mistakes; the eval harness catches behavior changes. Without the eval harness as a gate, the only gate is the team's collective memory of what the agent should do, and that memory drifts as the team ships. The eval harness, plus the rubric file it reads, is the durable representation of 'production-ready' that survives team turnover, model swaps, and the slow erosion of standards that happens to every project.

What does the rubric file actually contain?

Roughly five sections. First, the per-axis grading rubric (faithfulness, helpfulness, completeness, tone, policy) with a calibration prompt for the LLM judge. Second, the per-slice thresholds (head, tail, high-stakes, adversarial, long-context). Third, the regression rule (a previously-passing case that now fails is a hard fail). Fourth, the latency and cost ceilings. Fifth, the override policy (who can sign off on a failed gate, and what gets logged when they do). The whole thing is usually under 200 lines of YAML.

How is the production rubric different from the demo rubric?

The production rubric is per-axis instead of binary, per-slice instead of aggregate, and includes a high-stakes hard gate and a regression rule. The demo rubric is usually 'did the agent answer the curated question correctly,' applied once, by a person, in a meeting. The shift is from 'is the agent good?' to 'on which slice and on which axis is it good enough to ship?' The first question has no defensible answer. The second question has a numeric one.

Who writes the rubric?

The team that owns the agent, usually with help from the team that owns the user-facing risk (support, legal, ops). The rubric is a collaboration; engineering writes the per-axis judge prompt, product writes the slice definitions, support writes the high-stakes case list, legal writes the policy constraints. The output is a single file in the repo. The collaboration takes a couple of days the first time. Subsequent updates are PRs, just like code.

What stops the team from quietly lowering the bar over time?

Two things. First, the rubric file is version-controlled, so threshold changes are PRs that can be reviewed and reverted. git blame on a threshold change tells you who moved the bar and why. Second, the regression rule means a previously-passing case that now fails is a hard fail regardless of the threshold; the bar can move on aggregates but not on cases the team has already committed to passing. Together those two patterns make silent drift hard. They do not make it impossible; that is what culture is for.

How does the rubric handle overrides?

Explicitly. The rubric file names the override policy: who can sign off on a failed gate, what kind of failures can be overridden, and what gets logged. Overrides are not forbidden; they are auditable. The point is not to prevent shipping a known-imperfect agent but to make sure the team knows when they are doing it. An override that gets used twice a quarter is a working safety valve. An override that gets used twice a week is a sign the rubric is wrong and needs a rewrite, not that the gate should be lower.

Does this slow down the team?

Initially yes, by maybe one to two weeks for the rubric build and the harness wiring. Then it speeds the team up substantially because the shipping decision becomes mechanical. The arguments stop. The rollbacks stop. The 'is this good enough' meetings stop. The team ships more changes per quarter, not fewer, because each one is faster to qualify. The slowdown is the upfront cost; the speedup is permanent.

What if my agent does not have any production traffic yet?

Start with the rubric anyway. Use synthetic cases for the slices and curated cases for the high-stakes gate. The harness should be running against something from the day you ship the first prototype, even if the cases are all hand-written initially. Then as production traffic comes in, the prod-trace ingest pipeline replaces the synthetic cases with real ones. The rubric does not need real traffic to exist; it needs real traffic to mature.

Where does fde10x fit?

We are one option for teams that want a senior engineer to embed for two to six weeks and ship the rubric file, the harness, the per-slice grading, the regression rule, and the override policy into the client repo. Common engagement: week 1 collaborating on the rubric (engineering, product, support, legal), week 2 wiring the harness, week 3 backfilling the regression set from past incidents, weeks 4 to 6 connecting it to CI and the trace ingest. The leave-behind is the gate running against every PR, owned by the team, version-controlled in the repo.