Guide, topic: LLM bake-off methodology, shipping decisions, 2026

LLM bake-offs: aggregate scores will lie to you. Tail-prompt slices will not.

A bake-off that ranks candidate models by their mean score on a shared eval set will produce a recommendation. That recommendation will be wrong about a third of the time. The reason is structural: the aggregate hides tail-prompt regressions that ship as production incidents the week after the swap. This guide is the bake-off shape we ship into client repos: how to mine the eval set, how to slice it, what counts as a regression, how to keep the judge honest, and the one-page rubric that turns the result into a shipping decision instead of a Slack thread.

M
Matthew Diakonov
12 min read

Why the aggregate score loses the argument

Pretend you are running a bake-off between an incumbent and a challenger model. You have 1,200 cases in your eval set. The incumbent scores 82.7. The challenger scores 84.2. The slide gets written, the swap goes on the roadmap, and a week after the swap the support team starts seeing complaints about the agent mishandling refund requests for international customers. By the time the team correlates the complaints with the model swap, two weeks have passed and the rollback is its own headache.

The reason this happens is that the 1,200 cases were drawn from a distribution that does not match production. Even if it does, the mean is the wrong statistic for shipping decisions. The right statistic is the per-slice score, because production agent failures are concentrated in slices, not spread evenly across cases. A new model can be one point better on every common case and three points worse on every refund case, and the mean will love it while your refund customers do not.

The fix is methodological, not technical. The bake-off has to slice. Each slice has to be sized to give a confidence interval that supports a decision. The shipping rubric has to read the slices, not the mean. And the high-stakes slice has to be a hard gate: no new failures, no exceptions, regardless of how the rest of the bake-off looks.

The slices that matter, with numbers from a recent bake-off

These six slices come from a real bake-off run for a customer support agent in late March. The numbers are rounded for anonymity but the shape is genuine. The bottom line was that the challenger model won the aggregate by 1.5 points and lost the shipping decision by a wide margin.

Head slice (high-frequency intents)

About 60 to 70 percent of production traffic. The model that wins here usually wins the aggregate by default. Real example: in a recent customer bake-off the new model scored 87.4 on this slice vs the incumbent's 86.1, a clean 1.3 point gain. This is the slice that fills the headline number on the slide.

Tail slice (low-frequency intents)

Maybe 5 percent of traffic but 30 percent of the user base over a quarter. The new model in the same bake-off scored 71.2 on this slice vs 78.5 for the incumbent. A 7.3 point regression hidden inside a 1.3 point aggregate gain. The shipping decision was no, on this slice alone.

High-stakes slice (refunds, legal, medical, money)

Maybe 1 percent of traffic. The slice where any new failure is a hard no. In the same bake-off, the new model added two failures here that the incumbent did not have, both on a refund-eligibility prompt that touched currency conversion. Two failures was enough to kill the swap regardless of the headline score.

Adversarial slice (jailbreaks, prompt injection, abuse)

About 2 to 4 percent of traffic in most consumer-facing agents. The new model's jailbreak rate was 0.4 percent higher than the incumbent's, which sounds tiny until you multiply by daily volume. At 100k requests a day that is 400 extra leaked completions per day. Same answer: do not ship without remediation.

Long-context slice (over 32k input tokens)

Often the slice that benefits most from a new model release because long-context handling is where vendors invest. In this bake-off the new model scored 81.0 vs 73.8 on long-context, a 7.2 point gain. That gain alone made the case for using the new model in a specific document-grounded sub-flow even while the rest of the agent stayed on the incumbent.

Latency and cost (the slice the rubric does not score)

The new model was 18 percent faster at p50 but 33 percent slower at p95. Cost per million output tokens was 22 percent lower. The shipping rubric still has to weigh these against the tail and high-stakes regressions. In the same bake-off, the team shipped a hybrid: long-context flows got the new model, everything else stayed.

How to mine the eval set so the slices are real

The single most common reason bake-offs are wrong is that the eval set was hand-written in a sprint two quarters ago and the team never went back to update it. Production traffic moved. New intents emerged. Old intents got resolved upstream and stopped showing up. The eval set ossified around the team's intuition instead of the actual usage. A bake-off against an out-of-date eval set is a bake-off against the team's memory of last summer.

The shape we ship: 70 percent of cases are mined from production traces in the trailing 90 days, sampled to reflect the actual intent distribution. 20 percent are pulled from past incidents, customer escalations, and support tickets. 10 percent are curated adversarial cases (jailbreaks, prompt injection, abuse patterns). Each case is labeled with an intent and a stakes flag (low, medium, high). The labeling takes a day or two for an engineer who knows the product; it does not require an annotation vendor for an eval set under 1,500 cases.

The eval set is then frozen for the duration of the bake-off and re-mined on a quarterly cadence. Mining it more often produces shifting baselines that make month-over-month comparisons noisy. Mining it less often produces drift between the eval distribution and production traffic that quietly invalidates the rubric.

The shipping rubric: one page, six numbers, no vibes

The output of a bake-off is not "model X is better." It is a one page rubric that the engagement owner reads and signs. The rubric has six rows: per-slice score table (incumbent vs challenger with deltas), per-slice pass/fail against the regression threshold (default 2 points), high-stakes hard gate (pass/fail), latency p50 and p95, cost per million input and output tokens, and a categorized list of failure modes for the challenger.

The decision rule is mechanical: if the high-stakes gate fails, do not ship. If any slice regresses by more than 2 points, do not ship. Otherwise, weigh latency and cost. The rubric removes the discretion that turns shipping decisions into Slack arguments. The owner can override the rule, but the override has to be on the page in writing, with the reasoning, so the postmortem in three months has something to read.

The hybrid deployment pattern falls out of the rubric naturally. If the challenger wins long-context by 7 points and loses tail intents by 3, the right move is not "ship" or "do not ship"; it is "route long-context to the challenger and keep tail intents on the incumbent." The rubric makes the case for the hybrid because it shows the slice-level deltas that justify it.

Sliced bake-off vs aggregate bake-off, side by side

Left column is the bake-off methodology we run. Right column is the bake-off most teams run on the first attempt. The right column is fast and feels rigorous; it is not. The left column costs a day or two more per run and produces shipping decisions that survive the next quarter.

FeatureAggregate bake-offSliced bake-off
What gets comparedOne mean score across all cases. The headline number on the slide.Per-slice score plus per-slice delta vs the incumbent. Slices include head (high-frequency intents), tail (low-frequency intents), high-stakes (refunds, legal, medical), adversarial (jailbreaks, prompt injection), and long-context (over 32k tokens).
How the eval set is constructed100 to 300 hand-written cases the team came up with in a sprint two quarters ago. Distribution unknown.70 percent mined from production traces, weighted to reflect actual intent distribution. 20 percent mined from past incidents and escalations. 10 percent adversarial cases curated by hand. Each case has a category and a stakes label.
What counts as a regressionThe headline mean drops. If it goes up, ship.Any slice that drops more than 2 points on its rubric, OR any high-stakes case that newly fails. Both gates have to be green to ship. The headline mean is informational only.
How the rubric is judgedLLM-as-judge with whatever model is convenient this month, no calibration.LLM-as-judge with a pinned judge model and a per-slice rubric. The judge prompt is versioned. The judge is calibrated against a 50-case human-labeled gold set every two weeks. Drift on the judge is itself a tracked metric.
How the shipping decision gets madeA Slack message: 'new model scores 84.2 vs 82.7, looks like a win, going to swap on Monday.'A one-page rubric in the repo: per-slice score table, per-slice delta, high-stakes pass/fail, latency p50 and p95, cost per million tokens, the number of failures categorized by failure mode. Owner reads it and decides.
What happens after the swapThe swap lands. Two weeks later a customer files a ticket about a refund flow that newly fails. The team learns the model is worse on that slice from a support ticket.Production traces from the trailing 7 days are graded against the same rubric. Any slice that regresses in production triggers a re-test against the eval set. The bake-off does not end at the merge; it ends at the 7 day production grade.
Vendor lock and reversibilityThe eval set was tuned to the previous vendor's quirks. Going back is a separate research project.The bake-off is repeatable per vendor change. Same eval set, same rubric, same judge. Switching back is a one-line change with a known cost.

What the bake-off does not replace

The bake-off is a pre-production gate. It is not a substitute for production observability. The cleanest bake-off in the world still misses production failure modes that the eval set did not capture, because the eval set is by definition a sample of past traffic. The right shape is to pair the bake-off with a graded production trace pipeline that re-runs the rubric against the trailing 7 days of real traffic and surfaces any slice that regressed in production. The bake-off says "this model should ship"; the trace pipeline says "this model is, in fact, shipping correctly."

The two together are the shipping system. Either alone is half the answer. The bake-off without the trace pipeline ships confident regressions. The trace pipeline without the bake-off catches regressions after they hit users.

Where fde10x sits in this

fde10x is one option for teams that want a senior engineer to embed for two to six weeks and ship the bake-off harness, the sliced eval set, the calibrated judge, and the one-page rubric into the client repo. Common engagement shape: week 1 mining traces and labeling slices, week 2 standing up the harness and calibrating the judge, week 3 running the first bake-off and authoring the rubric, weeks 4 through 6 wiring it into CI and the trace pipeline so monthly bake-offs run hands-free.

Plenty of teams build this themselves. The embed is the right call when the team is currently picking models by gut feel, rolling them back at a measurable rate, or arguing about the same shipping decision in three different Slack channels. Once the rubric exists, the arguments stop, because there is one page that decides.

Want a senior engineer to run your next model bake-off the right way?

60 minute scoping call with the engineer who would own the build. You leave with a draft of the slices we would mine, the rubric we would write, the judge we would pin, and a fixed weekly rate to ship the harness, the first bake-off, and the trace pipeline that keeps the rubric honest in production.

LLM bake-offs and the per-slice shipping decision, answered

Why is the aggregate score so misleading?

Because most production agent traffic is not uniformly distributed. The head dominates volume. A model that gains a point on the head and loses three on the tail will show as a net positive in the mean and as a net negative in the user experience for the customers who hit the tail. The customers who hit the tail tend to be the ones with the most complex use cases, who are also disproportionately the ones who churn or escalate. The aggregate optimizes for the wrong distribution.

How do I know the slices I am picking are the right ones?

Mine them from production. Start with the intent classification you already have (or build a lightweight one from a sample of last quarter's traces) and bucket your eval cases by intent. Then label each case with a stakes flag based on the failure cost: low (mild user friction), medium (escalation), high (money, legal, safety). The slices that matter are the ones that show up in your real traffic and the ones that hurt the most when they fail. Hand-written slices the team brainstormed in a sprint rarely match either.

How big should each slice be?

Big enough that the score has a confidence interval you can act on. For a binary pass/fail rubric, that usually means 80 to 200 cases per slice. For a graded rubric, 50 to 120 is often enough. Smaller than that and the per-slice deltas you see in a bake-off are within noise. Bigger does not hurt but you start paying real money on judge calls. The right way to size it is to bootstrap a confidence interval from a single run and grow the slice until the interval is narrow enough to support the decision.

What does the high-stakes gate look like in practice?

A list of cases that the model must pass. Not 'must score X average', but must pass each one. Often these are 30 to 80 cases curated from past incidents, escalations, and adversarial testing. The shipping rubric treats any new failure on this slice as a hard no, regardless of how the rest of the bake-off looks. The cost of one wrong refund or one inappropriate medical answer is higher than any aggregate gain.

How do I keep the LLM judge honest?

Pin the judge model and the judge prompt. Do not let it drift. Calibrate against a small human-labeled gold set every two weeks: roughly 50 cases that you re-grade by hand. Compute the agreement rate between the judge and the human labels. If it drops below 0.85, retrain the judge prompt or pin a different judge. Treat the judge calibration as its own metric tracked over time, separate from the bake-off output.

What about cost and latency?

They belong in the rubric but not as the primary axis. The right shape is: pass the quality gates first, then weigh cost and latency as deciding factors between models that both passed. Most teams get this wrong by leading with cost and only bringing in quality at the end. The result is a swap that saves 20 percent on tokens and adds 7 percent to escalation volume, which is a much bigger cost line if you measure it.

How often should I rerun the bake-off?

Every model release from your incumbent vendor, every snapshot rotation, and on any major change to the agent's tool surface or system prompt. In practice that is roughly monthly for an actively maintained agent. The cost of a bake-off is a few hundred dollars in judge calls plus a few hours of an engineer's time if the harness is well built. The cost of skipping one is usually one production incident per quarter that the bake-off would have caught.

What about hybrid deployments where different flows use different models?

This is increasingly the right answer. The bake-off slices map naturally onto deployment slices. If the new model wins long-context by 7 points and loses tail intents by 3, the shipping decision is to route long-context queries to the new model and keep the rest on the incumbent. This requires a thin routing layer in front of the agent, which is a one-day build, but it lets you take the wins without eating the regressions.

Where does fde10x fit?

We are one option for teams that want a senior engineer to set up the bake-off harness and the per-slice rubric inside their repo. Common engagement: two to six weeks to mine the eval set from production traces, build the per-slice rubric, calibrate the judge, run the first bake-off, and leave behind the harness so the team runs it monthly. Plenty of teams do this themselves; the embed is the right call when the team is qualifying model upgrades by gut feel and producing a different shipping decision each time.

What is the smallest version of this I can ship next week?

Three things. First, take 200 production traces from the last 30 days and label each one with an intent and a stakes flag. Second, write a binary rubric per intent that a graders can apply consistently. Third, run the eval against your incumbent model and against one candidate, slice the results, and produce a one-page table with per-slice scores and deltas. That table is enough to start making shipping decisions on something other than a mean.