Alternative, comparison: client-specific rubric vs aggregate leaderboard
The leaderboard stopped predicting production at 88 percent. Your rubric is the gate now.
Every frontier model in May 2026 sits between 88 and 93 on MMLU. The band is narrower than the measurement noise. Teams running their own domain evals on legal, medical, and financial workloads measure 15 to 30 points of accuracy gap to whatever public benchmark looks closest. The shipping decision belongs to a rubric you control, in your repo, scored on cases drawn from your traffic. Below is the rubric.yaml shape we ship on every PIAS engagement and the gate workflow that reads it.
Direct answer, verified 2026-05-06
Aggregate benchmarks or client-specific rubric for the shipping decision?
- Use a client-specific rubric for the shipping decision. Public aggregate benchmarks (MMLU, MMLU-Pro, HumanEval, GSM8K) have saturated above 88 percent across every frontier model and no longer differentiate them.
- The gap from leaderboard score to production score on a real workload is 15 to 30 percentage points on legal, medical, and financial tasks, per teams running their own domain evals against the same models.
- Use aggregate benchmarks for one job only: shortlisting three or four candidates from a field of twenty. Then run your client-specific rubric on 100 to 200 representative cases against each candidate and let that decide.
- On a real engagement we wire this hierarchy into one file on your main branch. rubric.yaml carries three layered keys: aggregate_informational, client_slices, and high_stakes. The CI gate reads the latter two.
Sources verified 2026-05-06: TrueFoundry on enterprise LLM benchmarking, Digiteria Labs on benchmark contamination, Kili Technology on 2026 benchmark limits.
The math that broke the leaderboard
Four numbers explain why every PR that proposes a model swap on the basis of an aggregate score is making a decision the data does not actually support.
The first two numbers describe a saturated band. Between 88 and 93 is roughly the noise floor of MMLU itself; reasonable researchers disagree on the right answers to several percent of the questions. The third number is what teams measure when they grade the same models on their own cases instead of academic ones. The fourth is the average lab-to-production drop reported by enterprise teams running real agents. None of these gaps are visible in the leaderboard column the swap PR is using to justify itself.
Side by side, decision by decision
Left: a typical aggregate-driven decision flow on a production agent. Right: the client-specific rubric flow we wire into a PIAS engagement. Every row is a question that gets answered every time someone proposes a model swap.
| Feature | Aggregate benchmark | Client-specific rubric (PIAS) |
|---|---|---|
| Headline number on the leaderboard | Quoted in the Slack message that proposed the swap. The PR description has the leaderboard delta and nothing else. | Recorded in rubric.yaml under aggregate_informational. Lands as a field in the shipping report. Never blocks a merge. Never pauses an invoice. |
| How the eval set is built | Whichever public benchmark was easiest to point at. Often a single number from MMLU or HumanEval, occasionally Arena Elo, with no notion of what your traffic looks like. | 70 percent mined from your production traces, weighted to your real intent distribution. 20 percent from past incidents. 10 percent adversarial. Each case carries a slice tag and a stakes tag. |
| What a regression looks like | The leaderboard score went down. If it went up, ship. Tail behavior on your actual workload is invisible to the decision. | Any client_slice that drops more than rubric.client_slice_regression_max points. Any high_stakes case that newly fails. Both gates have to be green to merge. The aggregate moving in either direction is informational. |
| Run-to-run consistency | One pass through the benchmark, one number on the slide. The variance is hidden. The agent appears reliable in evaluation and unreliable in production. | Eight runs per case, p50 and worst-of-eight reported. The gate reads worst-of-eight, not the mean. A 60 percent single-run number that drops to 25 percent across eight runs is a fail, not a pass. |
| Where the rubric lives | In a Notion page that has not been edited since the launch demo. The thresholds drift in someone's head. The next swap quietly loosens them. | rubric.yaml on main, version-controlled, read by .github/workflows/pilot-gate.yml on every PR and on a Monday 09:00 UTC cron. Stakeholders argue by opening a PR to that file. |
| How a model swap is justified | PR description includes the leaderboard delta and a screenshot of the artificialanalysis.ai page. The slice picture is reconstructed from incidents the week after launch. | PR description includes the per-slice delta table, the high-stakes pass count, p50 and p95 latency, cost per million tokens, and the trailing seven-day production snapshot graded against the same rubric. |
| How the refund clause is wired | The MSA has an acceptance review clause. Triggering it requires written notice and 30 to 45 days of negotiation. The leaderboard number does not move it. | pilot-gate.yml posts to BILLING_REFUND_WEBHOOK when rubric_min_score for the client_slices block falls below threshold on the Monday snapshot. Open invoice pauses. A GitHub issue opens with label refund-triggered. No human clicks anything. |
Anchor fact: rubric.yaml has three layered keys
On every PIAS engagement the same rubric.yaml shape lands on the client's main branch. Three top-level keys make the hierarchy explicit so a new engineer reading the file cold cannot accidentally treat MMLU-Pro as load-bearing.
Anchor fact
Three layered keys, one of them is informational, two of them are the gate
- aggregate_informational. MMLU-Pro, HumanEval, Arena Elo, SWE-bench Verified. Recorded for context. Weight 0.0. Read by no required gate. Visible in the per-PR report at the top so reviewers see the headline next to the slice numbers.
- client_slices. Head intents, tail intents, long context, adversarial. Each slice has its own rubric_min_score and a weight that sums to 1.0. The weighted score is the gate.
- high_stakes. Refund-amount accuracy, medication dosage tracing, legal disclaimer presence, anything where a single newly failing case is a release-blocker. Pass-fail. The hardest gate in the file.
The three keys are not interchangeable. Promoting an aggregate score to client_slice weight is the failure mode this shape is designed to prevent. We have seen teams drift into it by accident; rubric.yaml is reviewed by the engineer who owns the gate, not by whoever opened the PR.
The file, verbatim
A redacted excerpt from the rubric.yaml shipped on a discharge summary drafter agent. The aggregate block is present so reviewers see the headline. The client slices and high-stakes block are what the gate reads. The billing webhook hook at the bottom is the refund clause expressed as code instead of as an MSA paragraph.
One swap, two stories
A real (redacted) PR. The first column is the Slack thread that proposed the swap on the basis of the leaderboard. The second is the same swap viewed through rubric.yaml. The first version would have shipped. The second version did not.
Same agent, six weeks later
Toggle between the two flows to see what each one leaves on the client's repo at week 6. The model under consideration was the same. The leaderboard column was the same. The decision rule was different, and so the artifact, the incident count, and the rollback story are different.
Two decision rules, one production system
Week 6 of an engagement that picked the model from artificialanalysis.ai. The router runs the candidate that ranked first on MMLU-Pro at the time of the swap. Production traces show a 14 point drop on long-context cases, a quietly higher refund-amount hallucination rate, and three escalations the on-call engineer caught in the last 14 days. The PR that flipped the model has no slice numbers in the description; the rollback PR is now blocked because nobody can point at a file that says what good looks like for this agent.
- no client_slices block in any rubric file
- swap PR description was a leaderboard screenshot
- incidents found by users, not by CI
- rollback blocked because thresholds are a meeting, not a file
- next swap is on the same path
“The aggregate benchmark weight in rubric.yaml is 0.0 on every PIAS engagement. The number is recorded so a reviewer can see it. The gate ignores it. The shipping decision belongs to client slices and high-stakes cases scored on prompts mined from the client's traffic, run worst-of-eight, with the refund webhook wired to the slice threshold.”
rubric.yaml shape, applied across Monetizy.ai, Upstate Remedial Management, OpenLaw, PriceFox, OpenArt
When the leaderboard is still the right call
The argument is not that aggregate benchmarks are useless. They are useful for one job: narrowing twenty candidates to four. A shortlist is a coarse signal task and the leaderboard is fine at it. Past the shortlist, the leaderboard stops carrying information; the candidates are within the noise band and your workload is what separates them.
Three other cases where aggregate is genuinely the right input. First, when a checkpoint your agent depends on is deprecated and the next version from the same provider is the only path forward; the rubric becomes a confirmation, not a decision. Second, when a new modality opens up (long-context, vision, audio) and your client corpus has no cases that exercise it yet; the leaderboard is the only signal until the corpus catches up. Third, when cost or latency moves an order of magnitude and your rubric is comfortably above the threshold; the aggregate score is a useful tiebreaker.
Outside those three, defaulting to the leaderboard is the failure mode the rubric.yaml shape exists to prevent. The Monday cron, the refund webhook, and the worst-of-eight metric are all there because we have watched aggregate-driven swaps quietly degrade agents that the leaderboard kept calling improvements.
Want this rubric.yaml shape in your repo, with the gate workflow and the refund clause wired to your slice threshold?
60 minute scoping call with the senior engineer who would own the build. You leave with a one-pager: the agent, the slice list, the rubric_min_score per slice, the high-stakes cases, and the named engineer who will deliver it.
Client-specific rubric vs aggregate benchmarks, the questions buyers actually ask
If frontier models all score 88 to 93 on MMLU, can we just pick the cheapest one?
Cheapest is usually a fine starting point, but only after your client-specific rubric has run on at least 100 representative cases of your real workload. The 88 to 93 band is the headline mean across thousands of academic questions. Your traffic is not academic questions. We have seen cheaper models lose by 4 to 6 points on tail intents and high-stakes slices while looking equivalent on the leaderboard, and we have seen them win for the same reason. The aggregate score does not predict it. Run the client rubric on the cheap model, run it on the expensive model, and let the slice deltas decide. If they are within noise, ship the cheap one and record both sets of numbers in rubric.yaml so the next swap has a baseline.
We already track MMLU-Pro and Arena Elo internally. Why add another rubric?
Public benchmarks are useful for shortlisting and for trend lines on raw model capability. They are not useful for deciding whether a specific model produces correct medical, legal, or financial output on your prompts at your latency budget. The honest split is two layers, not one. Keep MMLU-Pro and Arena Elo as the funnel from twenty candidates to four. Then run a client-specific rubric on those four against 100 to 200 cases mined from your production traces. The shipping decision belongs to layer two. We treat layer one as informational in rubric.yaml so the headline is visible in PR review without ever blocking a merge.
What does run-to-run variance have to do with this?
Public benchmarks usually report one pass per question. Real agents make eight, ten, or fifty calls before completing a task and the variance compounds. A model that scores 60 percent on a single pass against your eval can drop to 25 percent across eight consecutive runs of the same case, and that is the number that ships. We set runs_per_case to 8 in rubric.yaml and the gate reads worst-of-eight, not the mean. The leaderboards do not measure this. A model that ranks first on a single-pass benchmark and second on a worst-of-eight evaluation is a model that will look great in evaluation and break in production. Worst-of-eight is the metric the on-call engineer cares about.
Where in the repo does this rubric live, and what reads it?
rubric.yaml lives at the root of an eval directory on the client's main branch (we use eval/rubric.yaml or rubric.yaml at root depending on existing layout). Two GitHub Actions workflows read it. .github/workflows/pilot-gate.yml runs on every PR that touches the agent code or the rubric, and on a Monday 09:00 UTC cron. .github/workflows/production-gate.yml runs only on a promotion PR with the title prefix promote: <agent> pilot -> production and verifies the seven-day trailing nightly history. Both gates read rubric.yaml directly. There is no separate config in a vendor system, no Notion page, no spreadsheet that has to be reconciled.
How is the refund clause wired to the rubric?
The Monday 09:00 UTC cron in pilot-gate.yml runs the eval set against the agent in production. If the weighted score on the client_slices block drops below rubric_min_score, a refund-signal job posts the failing snapshot to a billing webhook configured at billing_refund_webhook in rubric.yaml. The webhook pauses the open invoice. The same job opens a GitHub issue against the engagement owner with labels refund-triggered and billing-paused. No human triggers anything; the gate is mechanical. If the next Monday is green, billing resumes with a 7-day prorated credit. The clause is wired to the client slice, not to the aggregate benchmark, because aggregate movement does not map to user-visible regressions.
Are there cases where the aggregate benchmark should still drive a swap?
Yes, but they are narrower than they look. Three real cases. (1) The model you are using has been formally deprecated and the only path forward is the next checkpoint from the same provider; the rubric is then a confirmation, not a decision. (2) A new modality opens up (long-context, vision, audio) and your client rubric has no cases that exercise it; the leaderboard is the only signal until you grow the corpus. (3) Cost or latency moves an order of magnitude and your rubric is comfortably above the threshold; the headline number tells you which candidate to grade first. Outside those, defaulting to the leaderboard is the failure mode this whole rubric is built to prevent.
MMLU has saturation and contamination problems. Does that mean we should drop it from rubric.yaml entirely?
No, but treat it as informational. Independent audits in 2025 to 2026 confirmed that MMLU, GSM8K, and ARC-Challenge questions appear verbatim or near-verbatim in training corpora for several major model families, and the contamination on MMLU has been estimated to inflate scores by 8 to 15 points for some models. That is exactly why aggregate_informational has weight 0.0 in our rubric.yaml shape. We still record the score so a reviewer can see whether the candidate is in the same ballpark as the incumbent on academic knowledge, but the gate ignores it. MMLU-Pro, GPQA, and the harder successors are slightly more useful for shortlisting because contamination is lower, but the same rule applies: informational only.
How big does the client rubric need to be to be trusted?
100 to 200 graded cases is the working minimum across every engagement we have shipped. Below 80 cases the per-slice signal is too noisy to set a regression threshold tighter than 5 points, which is wider than most real differences between candidates. At 200 you can set per-slice thresholds at 2 to 3 points and trust them. We split the cases roughly: 70 percent mined from production traces, weighted to actual intent distribution; 20 percent from incidents and escalations; 10 percent adversarial cases curated by hand. Cases are versioned in eval/cases.yaml. The corpus grows by 10 to 20 cases per week through a tail-mining workflow and the gate reads the current corpus on every run.
What happens to my MMLU-Pro and HumanEval numbers in the shipping report?
They render at the top of the per-PR report, labelled informational, with a one-line note that they did not factor into the gate decision. We keep them visible for two reasons. First, reviewers expect to see them; hiding them creates more friction than recording them at weight 0.0. Second, a sustained trend across many candidates (for example, every new candidate is ten points worse on HumanEval) is a signal worth investigating outside the gate, even if it does not block any single merge. The slice numbers go below the aggregate block in the report so the eye lands on the gate-relevant data first.
We are about to do an LLM bake-off across providers. How do we run this without doubling the work?
Run the aggregate score for shortlisting from a public source you trust (artificialanalysis.ai, lmsys.org for Arena Elo, MMLU-Pro from huggingface). Pick three or four candidates. For each, run the client rubric in a single CI job that scores all candidates against the same rubric.yaml in parallel. The output is one shipping report with four columns and the per-slice deltas between every pair. Total wall-clock is 30 to 60 minutes for 200 cases and four models on most production setups. The bake-off ends when one candidate clears every threshold and the others miss at least one. We have a separate guide on how the bake-off methodology slices head, tail, and high-stakes cases, linked at the bottom.
Adjacent reading
More on the eval-as-gate engagement model
LLM bake-offs: aggregate scores will lie to you, tail-prompt slices will not
The bake-off methodology that runs the same client rubric across four candidate models in one CI job and reports the per-slice delta table.
The AI execution gap is a missing rubric, the eval harness is the gate
Why the demo bar is not the production bar, and what it looks like when the rubric is a file in the repo instead of a slide in a deck.
Forward deployed engineer vs consultant: a 7-point rubric
The contract-clause rubric that distinguishes forward deployed engineering from advisory work, including the six file paths the engagement leaves behind on main.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.