Week 3 stall, FDE engagement notes

Why long horizon agents stall in week 3 on the same rubric that passed week 2

The week 2 prototype rubric was horizon-blind. It scored final outputs on short, single-decision traces, so error accumulation across turns, illegal transitions, and step-budget burn never penalized the score. Week 3 is when the same rubric is run against full multi-turn tasks and flatlines. The fix is not prompt tuning. It is rebuilding the rubric to grade trajectories instead of strings.

M
Matthew Diakonov
9 min read

Direct answer (verified 2026-05-06)

A long horizon agent stalls in week 3 because the rubric you wrote in week 2 grades the final answer string on short traces, so error accumulation, illegal transitions, and step-budget burn never moved the score. Run the same agent on full multi-turn tasks and a 0.86 week 2 score flattens to 0.45 in week 3 without the agent doing anything different. Add three axes to rubric.yaml: step_validity (per-transition legality and groundedness), budget_burn (steps consumed per case class against a fixed cap), and partial_credit_floor (proportional credit for half-finished trajectories). Re-grade the week 2 cases as full trajectories before you set the new floors. The pattern is consistent with the failure-mode taxonomy in the HORIZON benchmark paper (process-level risks dominate at 72.5% of long-horizon failures); the calendar-week pattern below is what we see on FDE engagements.

Monday morning of week 3: the score flips

The pilot-gate workflow on the merge queue fires at 09:00 UTC on the first Monday of week 3. The same eval/cases.yaml the team graded against on Friday of week 2 runs again, but this time the runner is no longer truncating trajectories at 4 turns. The full multi-turn rollout lands. The scorecard looks like this.

eval/_artifacts/scorecard.md

The temptation in the next 30 minutes is to blame the model release, blame the prompt, or quietly extend the prototype. None of those is the cause. The cause is that the rubric scored an agent that took four decisions per case, and is now scoring an agent that takes twenty-three. The five axes you have are not built to see the difference.

What the week 2 rubric actually scores

Open the rubric file you shipped in week 2 and read it. Every axis is scored on the final output: faithfulness on the final answer string, helpfulness on the final answer string, completeness on whether the final payload has the required fields, tone on the final answer string, policy on the final answer string. Nothing in the file knows the agent took twenty-three turns to get there.

Week 2 rubric: every axis is final-output graded

  • faithfulness scored only on the final answer string
  • helpfulness scored only on the final answer string
  • completeness scored only on whether required fields exist in the final payload
  • tone scored only on the final answer string
  • policy scored only on the final answer string

On a 4-turn trace, a final-output rubric is fine. There are not many transitions to be wrong about. On a 23-turn trace, the rubric is blind to the eighteen transitions in the middle, so an agent that makes four illegal tool calls, two non-progressing retries, and one ungrounded parameter substitution scores the same as an agent that executed twenty-three clean transitions, as long as the final string looks similar. The rubric is not lying; it is answering the wrong question.

The three axes a horizon-aware rubric carries

The fix is three new graded axes. Same rubric.yaml, same merge gate, same eval/cases.yaml. Floors are set against the week 2 cases re-graded as full trajectories, not against the prototype scorecard. That re-grade is non-negotiable; without it the new floors are still measured against numbers the short slice puffed up.

Horizon-aware rubric: eight axes total

  • step_validity: each transition is legal given prior tool state and observation history
  • budget_burn: steps consumed per task class against a fixed cap, with refund for early valid termination
  • partial_credit_floor: a half-finished trajectory earns proportional credit, never zero, never the full score
  • faithfulness, helpfulness, completeness, tone, policy: all five carried over, scored on the final state
  • termination: the agent stopped because the task was done, not because it ran out of steps

The rubric.yaml diff

This is the file the named senior engineer on the engagement edits in week 3, in a single PR. The merge gate (pilot-gate.yml) reads the file by name and does not need a workflow change. The judge prompt for step_validity is short enough that the merge reviewer reads it in one screen.

rubric.yaml
eval/judges/step_validity_prompt.md

What changes on Monday morning

Helpfulness fell from 0.88 to 0.42. The temptation is to rewrite the system prompt, add a few-shot example for the failing case class, and ship a PR. The score nudges back to 0.51. Next week the same trap finds you on a different case class, and the rubric still cannot tell you which transition is going wrong.

  • Score moves the wrong direction on a different slice
  • No signal on whether the agent is failing on plan, tool call, or termination
  • Loses week 3, often week 4, and sometimes the engagement

The four-step unstall sequence we run on FDE engagements

Week 3 of a forward-deployed engagement is where the rubric becomes horizon-aware. The four steps below are the exact sequence the named senior engineer runs on the day the prototype slice flips. The order matters: re-grade before you change the rubric, change the rubric before you wire it into the gate, wire it into the gate before you set the ratchet.

1

Re-grade the week 2 cases as full trajectories

Take the same eval/cases.yaml that scored 0.86 in week 2 and re-run it with full multi-turn rollouts capped at the realistic step ceiling for each case class. Expect the average to drop. The drop is the size of the horizon blindness, not a regression.

If the prototype was scored on truncated traces (max 4 turns), re-run with the same agent at the production ceiling (typically 16 to 32 turns). The delta between the two runs is your horizon-blindness budget; it tells the team how much of the 0.86 was coming from short traces, not from the agent.
2

Add step_validity, budget_burn, partial_credit_floor as graded axes

Append the three axes to rubric.yaml with floors set against the re-graded week 2 baseline, not against the prototype scorecard. The judge prompt for step_validity is short (15 lines) and lives at eval/judges/step_validity_prompt.md so the merge reviewer can read it in one screen.

3

Wire the rubric into the same merge gate that ran in week 2

No new CI. The pilot-gate.yml workflow that ran the week 2 rubric reads the new axes automatically because rubric.yaml is the input. Floors block merge. The first PR after the change will fail; that is the rubric finding what the prototype slice hid.

4

Set a one-week ratchet, not a permanent floor

On day 21 of the engagement, lift step_validity floor from 0.80 to 0.84. The ratchet schedule lives in rubric.yaml so the gate enforces it without a Slack reminder. Week 6 production bar is the third ratchet; the engagement does not extend if the third ratchet does not hold for five consecutive PRs.

Horizon-blind rubric vs horizon-aware rubric

Same eval harness, same cases, same merge gate. Different rubric file.

FeatureHorizon-blind (week 2) rubricHorizon-aware rubric
What is scoredFinal answer string onlyWhole trajectory: each transition, the budget burn, the partial credit, plus the final state
When error accumulation surfacesNever. A trace with five legal-looking turns and four illegal-looking turns averages out the sameStep 4 of a 23-turn rollout via step_validity dropping below 0.80
What a half-finished task scoresZero, because the final summary is missing; or 1.0, because the partial output looks fine in isolationProportional to subtasks completed, capped at 0.85 so it cannot disguise itself as a full pass
How step exhaustion is treatedUntracked. A 32-turn trace and a 4-turn trace score identically if the final string matchesbudget_burn: 1.0 if used under 50% of cap, 0 if exhausted, linear in between
What week 2 to week 3 looks likeApparent regression from 0.86 to 0.45, no signal on which axis broke, prompt-tuning theater for two weeksRe-graded baseline drops from 0.86 to 0.62; floors are set against 0.62 + room to ratchet
Where the judge prompt livesInline in run.py or absenteval/judges/step_validity_prompt.md, in the repo, edited by PR
What a senior engineer reads first when the merge failsAggregate score and a list of cases sorted by dropPer-axis floor breach + first failing transition + the judge's reason field

Why prompt tuning is the trap, not the fix

On the first Monday of week 3, helpfulness fell from 0.88 to 0.42. The natural reaction is to rewrite the system prompt and add a few-shot example. The score nudges back to 0.51. The team feels unstuck. A week later the same trap finds you on a different case class, and the rubric still cannot tell you which transition is going wrong, so you write more prompt. By week 5 the engagement is two weeks behind and the rubric still grades final strings.

The reason prompt tuning feels like progress is that helpfulness measured on the final answer string is genuinely sensitive to prompt edits. A few-shot example will move it 5 to 10 points. The reason prompt tuning is not the fix is that the underlying failure (the agent making illegal transitions or burning the step budget) does not move with the prompt. The score moves; the agent does not. The second time the gate flips, the prompt edit you ship next will move the score even less.

A horizon-aware rubric makes the underlying failure visible, which makes the prompt edit a one-bullet PR instead of a seven-paragraph essay. The shape of the fix changes from "what should I write next" to "step_validity is 0.62 on the research case class because the planner is retrying 4xx without modifying the request, and that is two lines in the agent loop, not the prompt."

Stuck on week 3? We have run this unstall on five engagements.

A 30-minute scoping call gets you a written one-pager naming the three axes your current rubric is missing and what re-grading the week 2 baseline as full trajectories would tell you about the agent.

Frequently asked

Frequently asked questions

Why does week 3 of a long horizon agent stall?

The rubric you wrote in week 2 was horizon-blind. It scored final outputs on short, single-decision traces, so error accumulation across turns, illegal transitions, and step-budget burn never penalized the score. Week 3 is when the same rubric is run against full multi-turn tasks and flatlines: the same 12 cases that scored 0.86 on 4-turn rollouts come back at 0.45 on 23-turn rollouts. The fix is not prompt tuning. The fix is adding step_validity, budget_burn, and partial_credit_floor as graded axes and re-grading the week 2 baseline as full trajectories before setting new floors.

Is this a regression, or did the agent never work?

Almost always neither. The agent is doing about what it did in week 2; the prototype slice was just rewarding it for not having to make many decisions. A 4-turn trace can have one bad transition averaged into seven good final tokens and still pass. A 23-turn trace cannot. Re-grade the week 2 cases as full trajectories before you decide whether the team broke something. In our engagements, that re-grade lands the prototype baseline somewhere between 0.58 and 0.72, not 0.86. Set floors against the re-grade, not the original scorecard.

What are the three axes I am missing on a horizon-blind rubric?

step_validity grades each transition (state, action, next_state) for legality, progress, and groundedness using a small judge prompt at eval/judges/step_validity_prompt.md. budget_burn measures steps used against a per-case-class cap; the formula is max(0, 1 - (steps_used / step_cap)) so an exhausted trace scores zero on this axis even if the final string looks fine. partial_credit_floor lets a half-finished trajectory earn proportional credit (capped at 0.85) so a good 70% completion does not score zero alongside a bad 0% completion. The three together catch the failure shapes a final-output rubric cannot see.

Do I need a separate eval harness for horizon-aware scoring?

No. The same harness that ran the week 2 rubric runs the horizon-aware one. rubric.yaml is the only file you change. The runner reads the new axes by name, calls the judge prompt for step_validity, computes budget_burn from the trajectory length and the per-case-class cap, and folds partial_credit_floor in as a weighted axis. The merge gate (typically pilot-gate.yml in .github/workflows/) does not need an edit because it consumes the rubric output, not the rubric definition. The first PR after the rubric change will fail; that is the gate finding what the prototype slice hid.

How much does the budget_burn axis cost me?

Almost nothing. budget_burn is computed from data the runner already has: steps_used is the length of the trajectory, step_cap is a constant per case class kept in rubric.yaml under budget_burn.floor_per_case_class. No judge call, no extra inference, no API spend. step_validity is the expensive axis because each transition is graded by a judge model, but the slice you grade does not have to be the full case set on every PR; pair it with active recall scheduling and you grade about 25 cases per PR with maybe 20 transitions each, which is roughly 500 judge calls at fractional cents per call.

What if my agent has variable trajectory length and I cannot set a step_cap?

You can, you just have to set it per case class instead of per agent. A research case is 16 to 32 steps; a write-email case is 4 to 8; a procurement case is 32 to 64. Put the caps under budget_burn.floor_per_case_class in rubric.yaml. Cases without a class fall back to a default cap, which we set to the 90th percentile trajectory length over the prototype week's traces, plus 25% headroom. The cap is a PR-reviewable constant, not a learned parameter; an engineer raises it when the case genuinely got harder, not when the agent regressed.

Is this what an FDE engagement actually does in week 3?

Yes. On a forward-deployed engagement (week 0 rubric, week 1 first PR, week 2 prototype, week 6 production handoff), week 3 is where the rubric becomes horizon-aware. The named senior engineer in the rubric ownership block runs the four-step unstall on the same day the prototype slice flips: re-grade week 2 cases as full trajectories, append the three axes to rubric.yaml, set floors against the re-grade, schedule the first ratchet for day 21. The eval harness, the rubric.yaml diff, and the judge prompts are leave-behind IP. Client owns them, no platform license, no vendor-attached runtime. Week 6 ships if the third ratchet holds for five consecutive PRs; the engagement does not extend if it does not.

What if the score still flatlines after I add the three axes?

Then the agent has a real architectural problem the rubric is now correctly surfacing, and you have a name for it. step_validity 0.62 on a research case class with budget_burn 0.20 means the agent is making illegal transitions and exhausting its step budget on those transitions; the planner is the suspect, not the prompt. step_validity 0.85 with helpfulness 0.40 means the transitions are legal but the agent is solving the wrong task; the system prompt or the case definition is the suspect. The horizon-aware rubric does not fix the agent. It tells you which one of three or four buildable things to fix next, which is the part the prototype rubric never gave you.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.