Guide, topic: Google Next 2026, AI execution gap, eval harness, 2026
The keynote rubric is not the production rubric.
Google Next 2026 had a panel about the AI execution gap. The phrase finally has a name on a main stage. The thing nobody said cleanly: the rubric a model gets demoed on is not the rubric production gets judged on, and most teams ship the eval harness as an afterthought instead of a gate. This guide is the concrete version: rubric.yaml in the repo, a regression set checked in with the code, and CI that fails the merge on a slice regression. The gap closes when the harness becomes a merge gate, not when the model gets better.
What the panel actually said, and what it left out
The Google Next 2026 panel framed the AI execution gap as the distance between what a model can demo and what a model can do for a real customer at the bar that customer requires. The panelists pointed at the usual culprits: data quality, change management, integration debt, the difference between a pilot team and a production team. Those are real causes. They are also downstream of a much simpler one: the team has not written down what production-ready means, so the bar moves silently every quarter as the team rationalizes failures.
Production-ready has a precise definition once you decide to give it one. It is a rubric. The rubric has axes (faithfulness, helpfulness, completeness, tone, policy), slices (head, tail, high-stakes, adversarial, long-context), and thresholds (numeric, per axis, per slice). The rubric also has a regression rule: any case that previously passed and now fails is a hard fail, even if the aggregate looks fine. The rubric lives in the repo at eval/rubric.yaml. It runs on every PR. It blocks merges that fail it. The shipping decision is mechanical: the rubric passes or it does not.
The panel did not say this part out loud, probably because it sounds boring on a keynote stage. But it is the entire fix. Every other lever (better models, better prompts, better retrieval) only matters once the rubric exists, because without the rubric the team cannot tell whether any of those levers actually moved production.
Six ways the keynote rubric and the production rubric diverge
Each of these is a real shape of the gap. Each one explains a specific kind of "great keynote, struggling production" outcome and each one points at a specific section of the production rubric that has to exist if the agent is going to clear the customer's actual bar.
The Google Next demo had three rehearsed prompts. Production has 4,000 a day.
Every demo on a Google Next stage is curated. The prompts are chosen, the answers are pre-checked, the screenshots are reviewed by marketing. Production users send a stream of inputs that the team has never seen, in phrasings the team did not predict, with edge cases the team did not consider. The demo rubric was calibrated against three prompts; the production rubric has to be calibrated against the long tail. The execution gap is the distance between those two surfaces, and it is mostly invisible until the real users start typing.
The keynote grader was generous. The production grader is the customer.
On stage, a 70 percent helpful answer reads as a win because the audience is friendly and the next slide is loading. In production, a 70 percent helpful answer is a partial fail because the customer either re-prompts (frustration) or escalates (cost). The grader silently changed when the agent left the demo environment, and nobody updated the rubric to match. The agent's measured performance did not change; the user's tolerance did.
The keynote was single-turn. Production is seven turns deep.
Demos are almost always one-shot prompt-and-response because that is what fits in a slide. Real production conversations average four to seven turns, and the failure modes that show up in turn six (context drift, summary distortion, tool-state confusion) are completely absent from the demo rubric. The keynote agent looks great; the production agent has a measurably worse user experience because the rubric never tested the conversation shape that real users have.
The keynote had perfect retrieval. Production has a stale corpus.
For the demo, somebody re-ingested fresh content the week before. In production, the corpus is six months stale because nobody owns the freshness pipeline. The agent answers questions about products that have changed, prices that have moved, and policies that no longer apply. The keynote rubric never measured freshness; the production rubric must, with a slice that explicitly checks staleness against a known-fresh source.
The keynote had no high-stakes cases. Production is full of them.
Demos avoid the cases where the wrong answer costs real money or violates policy. Production cannot. The execution gap shows up most painfully here: an agent that ships against a demo rubric will quietly fail on refund eligibility, regulatory disclosure, medical accuracy, or any case where 'mostly right' is unacceptable. The production rubric needs a hard high-stakes gate; the keynote rubric never has one because high-stakes failures do not photograph well on stage.
The keynote was a screenshot. Production is a runtime.
The keynote demo is captured at one point in time. The production agent runs continuously against a moving model, a moving corpus, and a moving user base. The keynote rubric scored a snapshot. The production rubric has to score the system as it changes, which means running on every PR, on every trace ingest, and on every model snapshot. The shape is fundamentally different from the keynote.
What rubric.yaml looks like in a real repo
The file is not magical. It has five sections. The per-axis grading rubric names the axes the team grades on (faithfulness, helpfulness, completeness, tone, policy) and the calibration prompt for the LLM judge. The slices section names the cuts of traffic the team grades against (head, tail, high-stakes, adversarial, long-context) and the threshold for each axis on each slice. The regression rule says any case promoted into the regression set that now fails is a hard fail, period. The performance section names the latency p95 ceiling and the cost per session ceiling. The override section names who can sign off on a failed gate, which kinds of failures are overridable, and what gets logged when they are.
The whole file is usually under 200 lines. It takes about two days of meetings and one engineer's full week to write the first version. The first version is rough; the thresholds are guesses. That is fine. After two weeks of running against production traffic, the thresholds get adjusted. After two months the rubric is stable enough to be the team's real definition of production-ready, and the arguments about what to ship stop being arguments and start being PRs.
The regression set as the team's institutional memory
The regression set is the most underrated artifact in the project. Every promoted prod-trace failure becomes a regression case. The harness runs against all of them on every PR. A previously-passing case that now fails is a hard fail. Over twelve months, the regression set turns into the team's institutional memory of every failure mode the agent has ever had in production, and every PR that lands has been measured against all of them.
The mechanics: the trace ingest job runs nightly, pulls every session that triggered a support escalation or a low CSAT score, asks an LLM to triage the likely failure axis, and queues the candidates for human promotion. A human (engineering on-call, usually) reviews the queue weekly, promotes the real failures into eval/regression/{slice}.jsonl, and discards the noise. The promotion is a PR; the case has a documented reason for being in the set. After six months the regression set is 500 cases deep and growing by 20 a week.
Production rubric vs keynote rubric, side by side
Left column is what gets shipped into client repos. Right column is what most teams actually have on day one: a vague sense of "good enough" calibrated against a keynote demo and preserved in shared memory. The right column is faster on the first project. The left column is the only thing that survives the second quarter.
| Feature | Keynote rubric as memory | Production rubric as merge gate |
|---|---|---|
| Where the rubric lives | In a slide deck from the launch demo. Possibly in someone's notes app. Definitely not version-controlled. | eval/rubric.yaml in the repo. Reviewed on PRs. Tagged with the model, prompt version, and chunk config it was calibrated against. |
| What the regression set looks like | A folder of ad-hoc test prompts the engineer wrote during the prototype phase. Never updated since launch. | eval/regression/*.jsonl, one file per slice (head, tail, high-stakes, adversarial, long-context). Every file is a real prod trace promoted by a human after a failure incident. |
| When the harness runs | Once. Before the keynote. Then never again until a customer complaint forces it. | On every PR via CI. On every prod-trace ingest via cron. On every model snapshot via a manual command that any engineer can fire. |
| What CI does on a regression | Nothing. The PR merges, the regression ships, the customer hits it on Tuesday, support escalates on Wednesday, engineering fights to reproduce it on Thursday. | Fails the merge. Posts a PR comment with the slice diff, the failing case IDs, and a link to the trace replay. The author either fixes the regression or files an override with a documented reason. |
| How the bar moves | Silently. The bar drops one ad-hoc decision at a time as the team gets tired of failures. | Threshold edits are PRs. The team can argue about whether 0.91 is the right faithfulness floor in a code review, not in a Slack channel three months after the fact. |
| What 'production-ready' means at Google Next | A demo that worked on the keynote laptop for three rehearsed prompts. Not defensible on a hostile slide. | A rubric that returns numbers. 'Faithfulness 0.92, helpfulness 0.88, zero high-stakes failures, p95 latency 3.4 seconds, $0.07 per session.' Defensible on stage. |
| How the team handles a model swap | A multi-week vibe-check by the senior engineers, ending with a meeting and a guess. | Run the harness against the new snapshot, diff against the rubric, decide. The model swap is mechanical; the rubric is the bar. |
Where fde10x fits
fde10x is one option for teams that want a senior engineer to embed for two to six weeks and ship the rubric file, the harness, the regression suite, and the CI gate into the client repo. The work is collaborative; the engineer drives the rubric authoring sessions with engineering, product, support, and legal. The leave-behind is the gate running against every PR, owned by the team, version-controlled in the repo, with a runbook for the on-call engineer when a PR fails the gate at 02:00. Teams build this themselves all the time; the embed is the right call when the team has been shipping by gut feel and rolling back at a measurable rate, or when the rubric has been "we'll get to it" for two quarters and the gap keeps showing up as production incidents.
Want a senior engineer to write your rubric.yaml and wire the merge gate?
60 minute scoping call with the engineer who would own the build. You leave with a draft of the rubric file against your stack, the slice definitions we'd start with, the override policy, and a fixed weekly rate to ship the gate, the regression suite, and the first clean PR run inside your repo.
Google Next 2026 and the AI execution gap, answered
Why did the AI execution gap show up at Google Next 2026 specifically?
Because the buyer side of the room got loud enough about it to be worth naming on stage. Enterprise buyers have now been through two or three cycles of impressive demos that turned into disappointing pilots, and they have started asking the question 'what is the rubric you are demoing against, and how does that rubric relate to the rubric I will judge you on once you are in production'. That question does not have a good answer for most teams, which is why the gap got a name. Naming it is step one. Closing it is the rubric file in the repo and the eval harness wired into CI as the merge gate.
What goes in the rubric.yaml file?
Five sections, usually under 200 lines of YAML. First, the per-axis grading rubric (faithfulness, helpfulness, completeness, tone, policy) with a calibration prompt for the LLM judge. Second, the per-slice thresholds (head, tail, high-stakes, adversarial, long-context). Third, the regression rule: a previously-passing case that now fails is a hard fail regardless of the aggregate. Fourth, the latency and cost ceilings. Fifth, the override policy: who can sign off on a failed gate, what kinds of failures are overridable, and what gets logged when they are.
What does 'CI fail on slice regression' actually mean in practice?
The harness runs against the regression set on every PR. For each slice, the harness computes the per-axis scores and the failure count. CI compares those numbers against the rubric thresholds and against the previous main-branch run. If any slice drops below threshold, or if any case that was previously passing now fails, CI marks the build red and posts a PR comment with the diff. The merge button is locked. The PR author either fixes the regression, gets an explicit override from the rubric's named override authority, or closes the PR. There is no fourth option.
How big does the regression set need to be?
Smaller than people expect at the start, larger than people expect after a year. A new project ships with maybe 80 to 200 cases hand-curated across the slices. Within six months, the prod-trace ingest pipeline has promoted another 400 to 800 cases from real failure incidents. By year two, a healthy regression set is 1,500 to 3,000 cases. The size matters less than the discipline: every promoted case has a documented reason for being in the set, and every removal is a PR.
What happens when a model swap regresses on a slice?
The merge does not happen. The team has three real options: pin the previous model for the affected slice (a routing change in the rubric), accept the regression with an explicit override and a customer-facing communication plan, or roll back the model swap entirely. None of those options is silent. All of them are documented in the PR. The point of the rubric is not to prevent model swaps; it is to make the cost of a model swap visible before the swap ships, not after.
Where does fde10x fit?
fde10x is one option for teams that want a senior engineer to embed for two to six weeks and ship the rubric file, the harness, the regression set, the CI wiring, and the override policy into the client repo. The work is collaborative; the engineer drives the rubric authoring sessions with engineering, product, support, and (for high-stakes use cases) legal. Common engagement: week 1 collaborating on the rubric, week 2 wiring the harness, week 3 backfilling the regression set from past incidents, weeks 4 to 6 connecting it to CI and the trace ingest. The leave-behind is the gate running against every PR, owned by the team, version-controlled in the repo.