Engagement rubric / Week 2 prototype gate

FDE Week 2 prototype rubric: the file, the five axes, and the 30-minute decision meeting.

Most pages on forward-deployed engineering describe the role. None of them ship the rubric. This is the file we use on every fde10x engagement: a single rubric.yaml on main, five graded axes, 15 cases drawn from real production traces, thresholds 15 percent below the Week 6 production bar, a ratchet that climbs four points each Monday, and a refund webhook wired to the calendar-day-14 snapshot.

Matthew Diakonov, Written with AI

Published May 1, 20269 min read

Direct answer (verified 2026-05-01 against fde10x.com/how-it-works)

A Week 2 prototype rubric for an FDE engagement is a single rubric.yaml file on main that scores the prototype against five weighted axes (faithfulness 0.30, helpfulness 0.25, completeness 0.20, tone 0.15, policy 0.10), against at least 15 cases drawn from real production traces, with thresholds set ~15 percent below the Week 6 production bar (rubric_min_score 0.78, p95 6000ms, policy floor 1.00). It runs on every PR and on a Monday 09:00 UTC cron. The calendar-day-14 snapshot is the input to a 30-minute decision meeting between the engagement owner and the client lead: continue, or refund and exit.

The whole file

Here is the rubric.yaml we land on main on calendar day 7 of every engagement. One file, no wrappers, no helper repo. The shape is the value: five axes with floors, the week-2 row of the ratchet, the case set pointer, the refund webhook env name. Edit it via PR.

rubric.yaml

Why five axes, not one number

The single hardest mistake in a Week 2 rubric is collapsing the score to one number. A weighted average that reads 0.84 hides the case where the agent is faithful, helpful, complete, on-tone, and writes a claim the EU AI Act would flag. The five-axis shape catches the failure modes the average will not.

Per-axis floors plus a hard policy floor mean a single high-stakes fail on policy blocks the gate even if every other case is perfect. That is what makes the rubric an actual gate, not a dashboard.

1. Faithfulness (weight 0.30, floor 0.74)

The single most-watched axis. Did the output stay grounded in the inputs the agent was actually given? Ragas-judged on retrieval-augmented work, rubric-judged elsewhere. A weighted-average score that hides one hallucinated claim per ten cases is worse than a lower score that does not.

Floor 0.74 at week 2 means: out of 15 cases, the agent is allowed to be unfaithful on at most ~4 of them. By week 6 the floor is 0.92, so the answer becomes ~1.

2. Helpfulness (weight 0.25, floor 0.70)

Did the output answer the actual ask, or did it answer an adjacent ask that was easier? Helpfulness is where prototype agents most often fool the team in a demo. The case set has to include real production traces (not demo questions) for this axis to mean anything.

Look for the case where the agent is faithful (every claim is true), helpful score is 0.4, and the demo-day reviewer would have called it a win. That case is why this axis exists.

3. Completeness (weight 0.20, floor 0.65)

Are the required fields populated? For a discharge-summary drafter, that is the five named sections. For an underwriter assistant, the schedule of exclusions. For a sales draft assistant, the next step. Completeness is the easiest axis to grade and the most often skipped because it is boring.

Lower floor at week 2 because completeness lifts fastest with a prompt edit. Lifting it from 0.65 to 0.85 in week 3 is a normal week.

4. Tone (weight 0.15, floor 0.60)

Does the output sound like the client's brand or the model's brand? Graded against a one-page tone guide the client owns. The lowest-weight axis, but the one that gets every cancelation when ignored.

If the client has no tone guide, the engagement owner writes a one-pager in week 1 and the client lead signs it. Otherwise this axis is unscored and the gate moves to four axes.

5. Policy (weight 0.10, hard floor 1.00)

Compliance lines the agent must not cross. PHI handling, advice that requires a human, cases the EU AI Act flags as high-risk. Per-axis floor of 1.00. One fail across the case set blocks every PR until the case passes again. The weight of 0.10 in the average is symbolic; the gate is the floor, not the weight.

Policy is the only axis where week 2 thresholds equal week 6 thresholds. There is no graceful ramp-up on a policy fail.

The ratchet: same rubric, climbing bar

A flat threshold from day 14 onward fails one of two ways. Set it at the Week 6 production bar and the prototype misses every Monday until week 5; the team learns to ignore the gate. Set it at the prototype bar and the team never feels the bar move; week 6 arrives, production gate fires, surprise miss. The ratchet is the third option: same file, same axes, threshold climbs four points each Monday.

rubric.yaml (ratchet excerpt)

The week-2 walkthrough

Five steps from rubric.yaml landing on main to the Monday-morning decision meeting. The shape is the same on every engagement; the numbers in the rubric are the only thing that changes.

Day 7: rubric.yaml lands on main

First PR is in. The rubric file is one of the first artifacts. Five axes, five floors, the week-2 row of the ratchet, the case set pointer, the refund webhook env name. The PR title is rubric: <agent> v0. Two reviewers: engagement owner, client lead.

Days 7 to 14: cases.yaml grows from 5 to 15

The senior engineer pulls cases from the client's production traces (or, for a green-field product, from interviews with the operators who will use the agent). Each case has expected fields, stakes tag, and per-axis ground truth. Day 14 minimum is 15 cases. 17 to 20 is normal.

Day 14, 09:00 UTC: pilot-gate.yml fires

The Monday-morning snapshot runs. Score is computed. .pilot-gate/latest.json is written. If the score misses, the refund webhook fires before anyone is at their desk. The artifact is there for the 09:30 meeting either way.

Day 14, 09:30 UTC: 30-minute decision meeting

Engagement owner and client lead. Two files on screen: today's snapshot and day 7's snapshot. They look at the slope, not the absolute number. If the slope is achievable against next week's row of the ratchet, gate stands. If not, the refund clause fires for real and the engineer pairs with the client team to ship the leave-behind that week.

Days 15 to 21: ratchet climbs to week-3 row

Same rubric file. Same axes. Higher floor. The team feels the bar move up by ~4 points and adjusts. By week 4 the prototype is converging on production thresholds; by week 6 the production gate is the same numbers, just one step further up the ratchet.

The Monday morning, in terminal output

What pilot-gate.yml writes on calendar day 14 when the score misses, and what the 30-minute meeting reads to decide whether to keep building or fire the refund clause for real.

pilot-gate · Monday 09:00 UTC

The gate file

pilot-gate.yml is the only required CI check on the prototype repo in weeks 2 to 5. It reads the row of the ratchet that matches the current calendar week, runs the rubric, and writes the snapshot. The refund-signal job fires only on the Monday cron, only on failure.

.github/workflows/pilot-gate.yml

What goes wrong vs what we ship

A side-by-side of the failure modes we see in the scoping call and what the file we ship looks like instead. Six rows, all of them things we have watched a real engagement land on.

Feature	What we see in scoping calls	The shape we ship
What the rubric measures	One number. Pass or fail per case. Hides the case where the agent is faithful but useless, or helpful but off-tone, or both but writes a date in the wrong format.	Five axes: faithfulness, helpfulness, completeness, tone, policy. Weighted average plus a per-axis floor and a hard policy floor. The shape catches failure modes the weighted average would hide.
Where the cases come from	3 to 5 cases someone wrote on a Tuesday. The agent is good at exactly those 3 to 5. The first real user input is the first real eval case nobody graded.	15 to 20 cases pulled from production traces (or, on a brand-new product, from real user interviews). At least 4 tagged stakes:high. No screenshots, no curated demo questions.
Where the threshold lives	In a Notion doc, a Linear ticket, or the engagement owner's head. Different stakeholders carry different numbers. Refund clause has nothing to bind to.	rubric.yaml on main. Week 2 row of the ratchet sets rubric_min_score 0.78, ragas_faithfulness_min 0.74, p95 6000ms, policy floor 1.00. Same file holds weeks 3 to 6. Edits are a PR.
What is harder at week 2 than week 6	Week 6 production bar applied at week 2. The prototype is held to a standard the team has not had two weeks of fixes to hit. Failure looks structural; it is just early.	Nothing about the rubric is harder. Thresholds are 15 percent below week 6. Regression budget is unbounded because the prototype is meant to churn. p95 ceiling is 6000ms not 4500ms.
Who decides on Monday day 14	A 60-minute meeting with 8 stakeholders. The score is interpreted differently by each. The slope is never computed. The decision drifts to next week, then the week after.	Engagement owner and client lead. 30 minutes. Two artifacts on the table: .pilot-gate/latest.json from day 14, and the same file from day 7. They review the slope, not just the score.
What happens on a miss	An email thread. Three weeks later the engagement is still going, the prototype is still missing the rubric, the invoice is still being sent, and the refund clause is still a sentence.	Workflow fires the refund webhook. Invoice pauses. Issue opens with .pilot-gate/latest.json attached. Decision meeting still happens; if the slope says next week is achievable, the gate stands and the refund issue closes with a note.

Week 2 prototype rubric checklist

rubric.yaml on main by calendar day 7
Five axes with weights summing to 1.00 and per-axis floors
Hard policy floor at 1.00 from week 2 (no ramp-up on policy)
Ratchet rows for weeks 2 through 6 in the same file
eval/cases.yaml with at least 15 production-trace-derived cases
At least 4 cases tagged stakes:high
pilot-gate.yml as a required check on every PR + Monday 09:00 UTC
Refund webhook env name in rubric.yaml, not hardcoded
30-minute Monday day-14 meeting on the calendar
Day 7 snapshot retained so day 14 can read the slope

What you can copy off this page

The rubric.yaml shape, the ratchet schedule, the five-axis weighting, and the pilot-gate.yml workflow are not proprietary. We use them on every engagement and the leave-behind in week 6 is the same set of files in your repo. Copying them today is the right move whether you ever talk to us or not.

What is hard to copy is the case set. 15 to 20 cases pulled from real production traces, with per-axis ground truth and stakes tags, takes a senior engineer about three days to assemble for a typical agent. That is most of what we are doing in week 1 of an engagement. If you want to write a rubric without the engagement, write it in this shape and then spend the three days assembling cases. Most teams skip the three days and the rubric never bites.

Want senior engineers in your repo writing this rubric next week?

Sixty minutes with the engineer who would own the build. You leave with a written one-pager: the outcome, the data sources, the rubric, and a fixed fee. The week-2 cancel-and-refund clause is in the MSA.

Frequently asked questions

What is a Week 2 prototype rubric in an FDE engagement?

A single rubric.yaml file on main that scores the prototype against five graded axes (faithfulness, helpfulness, completeness, tone, policy), against 15+ cases pulled from real production traces, with thresholds set ~15 percent below the Week 6 production bar. It runs on every PR and on a Monday 09:00 UTC cron. The Monday-of-day-14 snapshot is the cancel-and-refund decision input. We use this exact shape on every fde10x engagement.

Why is the Week 2 threshold lower than the Week 6 threshold?

Because a prototype scored against a production threshold fails every Monday in weeks 2 and 3. The team stops trusting the gate, the refund clause becomes theater, and the bar never gets felt. We use a ratchet: 0.78 at week 2, +0.04 each week, 0.92 at week 6. Same file, same axes, same case set. The rubric does not move; the bar moves up under it.

How many cases should a Week 2 prototype rubric have?

15 minimum on calendar day 14, 17 to 20 in practice. They have to be drawn from real production traces (or, on a green-field build, real interviews with the operators who will use the agent), not happy-path demo questions. At least 4 of the 15 should be tagged stakes:high and weighted 2.0. By week 6 the case set is at 30+ rows.

What's the difference between a Week 2 prototype rubric and a generic AI eval harness?

An eval harness is the machinery: ragas, the runner, the snapshot writer, the PR comment bot. The Week 2 prototype rubric is the configuration that decides what passes and what triggers a refund. The same eval harness with a different rubric file produces a different decision. Most teams ship the harness and skip the rubric file; the rubric file is what makes the engagement contractually enforceable.

Does the rubric live in the client's repo or fde10x's?

The client's. rubric.yaml, eval/cases.yaml, and .github/workflows/pilot-gate.yml all land on main in the client's GitHub. We hand them over with the runbook in week 6. The client owns the rubric, the case set, and the gate. We do not maintain a parallel copy or a vendor-attached runtime. That is the leave-behind.

What happens if the Week 2 rubric misses on Monday morning?

The pilot-gate workflow fires the refund webhook before anyone is at their desk. The invoice pauses, an issue opens with .pilot-gate/latest.json attached and assigned to the engagement owner and client lead. The 30-minute Monday meeting still happens. If the slope from day 7 to day 14 puts next week's row of the ratchet in reach, the gate stands and the refund issue closes with a note. If not, the refund clause fires for real and we ship the leave-behind that week.

Can I write my own Week 2 prototype rubric without an FDE engagement?

Yes, and you should. The shape is the value. Five axes with a hard policy floor, 15 trace-derived cases, a ratchet that climbs ~4 points per week, the rubric file on main, the gate as a required CI check, the snapshot retained so the slope is readable on day 14. None of those depend on us being in the repo. If you want help wiring it for a specific agent, the scoping call is free.

Other rubric and eval pages on fde10x.com

Related guides

Engagement model

AI pilot to production: the two gates and the promotion PR

What goes in pilot-gate.yml at week 2 and production-gate.yml at week 6, plus the one-line promotion PR that flips the rollout flag from 10 to 100 percent.

Read

Rubric

AI execution gap: the rubric is the missing artifact

Why teams ship against the demo rubric and miss the production rubric, and what the rubric file looks like when it is the shipping gate, not an afterthought.

Read

Eval harness

Production AI agent eval harness

The harness that runs the rubric: ragas, case-specific graders, regression set discipline, PR-time scoring, and the failure-mode catalogue every shipped agent ends up needing.

Read