Guide, topic: active recall eval harness, 2026

Active recall, but for your eval harness. Cases live in 5 boxes, the slice is what you are least sure about.

Most AI agent eval harnesses run every case on every PR. That buries regressions in cost and noise, and it silently treats “passed once” as “still passing.” We borrow the Leitner-box schedule from active recall (Anki, spaced repetition) and apply it to a per-case schedule file. Cases that pass repeatedly move to longer intervals (1 day, 3, 7, 14, 30) but always come back. Cases that fail reset to box 1 and re-test tomorrow. The slice in any one run is what the harness is least confident about. The files are eval/recall.yaml, scripts/select-recall-batch.py, and .github/workflows/eval-recall.yml.

M
Matthew Diakonov
13 min read
4.9from 5-box Leitner schedule, applied across 5 named production agents
Box 5 cases (passed roughly 7 times) still come back every 30 days; nothing graduates
PR cap of 25 cases pulls per-PR API spend from ~$32 to ~$1 on an 800-case set
Same shape on Pydantic AI, LangGraph + Bedrock, and a multi-model DAG

Same recall.yaml shape across Pydantic AI, LangGraph + Bedrock, custom orchestration, an automated nightly DAG, and a multi-model pipeline.

Monetizy.ai / Upstate Remedial / OpenLaw / PriceFox / OpenArt

Wire the schedule into your repo with an engineer
eval/recall.yamlscripts/select-recall-batch.py.github/workflows/eval-recall.ymlbox: 1box: 5interval_days: 30pass_streak: 7next_due: 2026-05-12LeitnerSM-2FSRSAnkirubric.yamleval/cases.yamlratchet provenancenightly cron

Direct answer (verified 2026-05-05)

What is an active recall eval harness for AI agents?

An active recall eval harness re-tests AI agent cases on a spaced-repetition schedule rather than running every case on every PR. Cases that pass repeatedly move to longer intervals (1 day, then 3, 7, 14, and 30) but always come back; cases that fail reset to a 1-day interval. The pattern is borrowed from Leitner boxes and Anki, applied to a per-case state file (eval/recall.yaml) read by a small selector (scripts/select-recall-batch.py) inside CI.

It exists to fix two problems with run-everything harnesses: cost (an 800-case set on every PR runs into thousands of dollars a month in API spend) and signal (a scorecard of 800 results buries the few that matter). The slice on any one PR is the union of cases that are actually due, all of box 1, and a small random sample of new traffic. A nightly cron drains the overdue tail.

Algorithm reference: spaced repetition (Wikipedia). Leitner boxes are the simplest concrete instantiation; FSRS and SM-2 are the learned-schedule variants used in Anki.

What every other guide on this misses

Open the dozen guides on AI agent eval harnesses that landed in the last quarter. They describe a layered stack (data validation, unit, integration, E2E, adversarial, CI/CD), a guides-and-sensors split, and a four-to-twelve-week build. None of them mention scheduling. Every example runs every case in the set on every PR. The implicit assumption is that re-testing a case the agent has been right about for two months is free.

It is not free. On a typical eval set of 800 cases at $0.04 per case (a fairly cheap unit; many production agents run higher), that is $32 of API spend per PR. At five PRs a day, that is roughly $4,000 a month, before nightly runs or replays. The team starts running the eval less often, gating it behind a label, or skipping it on small PRs. The harness becomes optional. Optional harnesses do not catch regressions.

The other half of the gap is signal. A scorecard with 800 rows and 12 micro-regressions is a scorecard nobody reads. The reviewer skims the headline, sees the score is within threshold, and merges. The micro-regressions land in production. Active recall caps the per-PR slice at about 25 cases, weighted by how confident the harness is in each case. A 25-row scorecard with 1 regression and a failure_note for context is one a reviewer actually reads.

And the third gap is silent rot. A case that has passed for nine weeks under a run-everything harness still passes invisibly: nothing surfaces it. A case in box 5 comes back every 30 days no matter how many sprints have gone by. The slowest cycle in the schedule is still bounded.

The algorithm in five lines

The point of writing this in five lines is that the team has to be able to read the entire scheduling logic in one breath, in code review. If the scheduler needs three classes and a strategy pattern, nobody on the team can defend it during a postmortem, and the schedule starts being treated as a black box.

select-recall-batch.py, the part that picks the slice

eval/recall.yaml, the per-case state file

One row per case in eval/cases.yaml. The runner edits this file in place at the end of every CI run and commits the diff in the same PR. case_id is the join key; everything else is schedule state. last_pr and last_model are denormalized on purpose so a triage engineer can answer “when did this case last pass and on which model” without leaving the file.

eval/recall.yaml

scripts/select-recall-batch.py, the selector

Eighty lines, one dependency (pyyaml), two modes. PR mode caps the slice at 25 due cases (oldest next_due first), pulls all of box 1 unconditionally, and adds 5 random new-traffic cases the schedule has not seen yet. Nightly mode drops the cap and drains every overdue case in the set. The selector writes eval/_artifacts/today.yaml; the runner reads only that file.

scripts/select-recall-batch.py

.github/workflows/eval-recall.yml, the CI gate

Two triggers, two scopes. PR mode runs the bounded slice and posts a scorecard to the PR. Nightly cron drops the cap and drains the tail; it posts to a tracking issue instead of a PR. Both modes update recall.yaml in place and commit the diff with an eval-bot identity. The PR engineer reviews the recall.yaml diff alongside their code change, which is the only way the schedule ever stays accurate.

.github/workflows/eval-recall.yml

How the pieces wire together

Three inputs feed the selector. The selector picks today's slice. The runner scores it and writes back to recall.yaml. Two surfaces consume the scorecard: the PR thread on PR mode, the tracking issue on cron mode. Nothing vendor-hosted is in the diagram on purpose; the schedule, the state, and the gate all live in the repo.

cases.yaml + recall.yaml + rubric.yaml -> selector -> scorecard

eval/cases.yaml
eval/recall.yaml
rubric.yaml
select-recall-batch.py
eval/run.py
PR comment
tracking issue
recall.yaml commit

The 5 boxes, with intervals

Box 1 is a one-day interval, the most attention. Box 5 is 30 days, the least. The schedule is deliberate, not learned: any change is a PR with a why. The cases the harness is least confident about (new arrivals, recent failures, anything in box 1) surface in every PR.

Leitner boxes for an eval set

1

Box 1, interval 1 day

New cases land here. So do cases that just failed. Box 1 is pulled into every PR run unconditionally, regardless of next_due. The harness is least confident about box 1, so it pays the most attention.

2

Box 2, interval 3 days

First promotion. The case passed once after landing or after a failure reset. Still cheap to re-test, still surfaces under most PRs because next_due is short.

3

Box 3, interval 7 days

Three consecutive passes. The case is reasonably stable; weekly re-test catches most prompt-and-model drift. A failure here resets to box 1 and writes failure_note with the PR number.

4

Box 4, interval 14 days

Stable for a sprint. The case still comes back twice a sprint, but a typical PR will not include it unless next_due aligns. The nightly drain catches the rest.

5

Box 5, interval 30 days

Settled. Cases here have passed roughly seven times in a row. They still come back monthly. They cannot graduate. The whole point of a forgetting-curve schedule is that 'permanently passing' is not a state you can reach.

Four failure modes this catches that a run-everything harness misses

None of these are theoretical. Each one is a pattern that surfaced on at least one engagement before we ported the recall.yaml shape into the leave-behind on every PIAS engagement.

Silent regression in long-tail cases

Run-everything harnesses surface regressions in cases the team is paying attention to. Cases the team last looked at six weeks ago drift down quietly because nothing forces a re-test. Box 5 cases coming back monthly is the explicit fix: a regression that lands on a Tuesday surfaces by the following Tuesday at the latest, with the PR that introduced it inside the next_due window.

PR cost ceiling

An eval that runs 800 cases per PR at $0.04 per case is $32 of API spend per PR, or roughly $4,000 a month at 5 PRs a day. Capping the slice at 25 cases per PR pulls that to about $1 per PR. Nightly drain handles the tail at the same cost ceiling, just on cron instead of on a developer waiting on CI.

Alarm fatigue in scorecards

When a scorecard reports 800 cases and 12 micro-regressions, reviewers stop reading it. When it reports 22 cases and 1 regression, with the offending case_id and the failure_note for context, the reviewer reads it. Active recall surfaces less, more pointedly.

Frozen case sets that stop predicting prod

A case set that has not surfaced any of its rows in three months is a benchmark the team has stopped grading on. A schedule that returns even box-5 cases to the top of the slice every 30 days keeps the set load-bearing instead of decorative.

Active recall vs run-everything, side by side

Left: the schedule. Right: the run-everything baseline most existing playbooks describe. Neither column is wrong. The right column is incomplete: it has no notion of per-case confidence, no schedule, and no defense against silent rot on stable cases.

FeatureRun-everything harnessActive recall harness
Cases run on a typical PREvery case in the set, often 200 to 800 rows22 to 25 cases: due (oldest first) + all of box 1 + 5 from new traffic
What surfaces under any one runEverything, weighted equally; signal buried in volumeCases the harness is least confident about, by design
Catching silent regressions on stable casesStable cases pass invisibly; a regression has to be obvious to surfaceBox 5 cases come back every 30 days, no exceptions
Per-PR API spendLinear in the case count; grows every time the team adds a rowBounded at max_due_cases x per-case cost (about 25 x your unit)
Where per-case schedule state livesNowhere; the harness has no notion of forgetting curve or per-case confidenceeval/recall.yaml, in the repo, edited by CI in the same PR that ran the slice
Nightly cronEither runs the same job again or does not run at allDrains overdue tail (cases the PR cap skipped); posts to a tracking issue
Adding a new caseAppend to eval/cases.yaml; the row joins the next 800-case run with no special statusAppend to eval/cases.yaml; box defaults to 1; first PR after pulls it in
What a failure doesLogs a regression in the scorecard; nothing changes about how often the case is testedResets the case to box 1, writes failure_note + failure_pr to recall.yaml
25 cases / PR

Five-box Leitner schedule. Eighty lines of Python. Two triggers in CI. The cases the harness is least sure about always surface; the cases it has been right about for weeks still come back, just less often. Same recall.yaml shape across Pydantic AI on Monetizy, LangGraph + Bedrock on Upstate Remedial, custom orchestration on OpenLaw, an automated nightly DAG on PriceFox, and a multi-model pipeline on OpenArt.

PIAS leave-behind across 5 named production agents, model-vendor neutral

Counts that anchor the schedule

Engagement-level facts, not invented benchmarks. Per-client production metrics live on /wins.

0Leitner boxes; intervals 1d, 3d, 7d, 14d, 30d
0PR cap on due-case slice; nightly cron drains the tail
0Lines in scripts/select-recall-batch.py
0 dayFrom a failure to the next re-test (box 1)

0 boxes. 0 cases per PR. 0 lines of selector. 0 day from a failure to the next re-test. The four numbers that make the schedule reviewable.

When not to use this

A case set under 60 rows. The whole point of the schedule is to ration attention across many cases; with 50 rows the run-everything harness costs cents and surfaces everything in one scorecard. Add the schedule when the set crosses ~150 rows or the per-PR cost crosses a number the engineer would notice on the bill.

Pre-launch agents in their first two weeks. Until the cases.yaml shape and rubric thresholds have stabilized, the schedule is grading a moving target. Run everything, churn the cases, and add the schedule once the set settles. Most engagements introduce recall.yaml around week 4.

Pure determinism agents (rules-based, narrow rule engines, ETL with no LLM-in-the-loop). They do not have a forgetting curve in the relevant sense. The schedule still works, it just buys you less because regressions in deterministic systems land at code-change time, not on a hidden interval.

Want a senior engineer to wire the recall schedule into your repo?

60-minute scoping call with the engineer who would own the build. You leave with eval/recall.yaml, scripts/select-recall-batch.py, and .github/workflows/eval-recall.yml drafted against your actual case set, plus a fixed weekly rate to land them.

Active recall, the schedule, the files, answered

What is an active recall eval harness for AI agents?

An active recall eval harness re-tests AI agent cases on a spaced-repetition schedule rather than running every case on every PR. Cases that pass repeatedly move to longer intervals (1d to 3d to 7d to 14d to 30d) but always come back, so the harness catches silent regressions and cuts CI cost without freezing the case set. The state lives in eval/recall.yaml, the selector in scripts/select-recall-batch.py, and the CI gate in .github/workflows/eval-recall.yml. The pattern is borrowed from Leitner boxes and Anki, applied to per-case eval scheduling.

Why not just run every case on every PR?

Two reasons that show up on every engagement. First, cost: 800 cases at $0.04 per case is $32 a PR, and at 5 PRs a day that is roughly $4,000 a month in API spend that nobody on the team has named. Second, signal: when a scorecard lists 800 results and 12 micro-regressions, reviewers stop reading it. Active recall caps the per-PR slice at about 25 cases, surfaces the cases the harness is least confident about, and pushes the rest to a nightly cron run that posts to a tracking issue. The case set still grows; the cost does not.

How is this different from sampling cases at random per PR?

Random sampling has no memory. A case can be sampled twice in two days, then not for two months. Active recall has memory in eval/recall.yaml: pass_streak, box, last_tested, next_due. A pass promotes the case one box and the next re-test gets pushed out; a fail resets the case to box 1 and the next re-test is tomorrow. The slice in any one run is the union of cases that are actually due plus all of box 1, not a random subsample. Two PRs in a row will not waste API spend re-testing the same case the harness is already confident about.

Why a 5-box Leitner schedule and not FSRS or SM-2?

Three boxes is too coarse: a case that just passed once skips three days of attention. Seven boxes is too fine: the schedule starts encoding judgements the team has not made. Five boxes at 1d, 3d, 7d, 14d, 30d covers the common drift cycles (a same-day prompt edit, a multi-day model upgrade, a sprint-long evaluation drift, a monthly upstream model release) without inviting tuning every week. FSRS and SM-2 are better for human learners with thousands of reviews per card; for an eval set of 80 to 1,200 cases on a 30-day horizon, deliberate boxes you can read off in YAML beat a learned schedule that needs justification.

What does the runner do at the end of a slice?

Three things. It scores each case in the slice. It updates eval/recall.yaml in place: pass_streak ticks up and box promotes by one on a pass, pass_streak resets to zero and box resets to 1 on a fail; next_due is set to today plus interval_days for the new box; failure_note + failure_pr are written on a fail. It commits the diff to recall.yaml in the same PR that ran the slice, so the schedule state is in the same review thread as the change that produced it. The eval-bot identity in .github/workflows/eval-recall.yml does the commit; the engineer never edits recall.yaml by hand.

Does this make it harder to debug a CI failure?

It makes a CI failure shorter and more pointed. The slice the runner used is in eval/_artifacts/today.yaml, the scorecard names every failed case, and recall.yaml records pass_streak, last_tested, and last_pr for each. To reproduce, run scripts/select-recall-batch.py --mode pr with the same date and the same seed, and the same slice falls out. The scorecard has fewer rows than a run-everything harness, and each row carries more state, so triage is faster, not slower.

Where do new cases enter the schedule?

Two doors. New cases appended to eval/cases.yaml that have no row yet in recall.yaml are pulled in via the new_traffic_sample knob (5 random cases per PR run by default), and the runner appends them to recall.yaml at box 1 after the first run. Cases promoted from a failure also live in box 1, so the box-1 unconditional pull catches both fresh cases and recently-broken ones. The case file and the schedule file are independent; deleting a case row in eval/cases.yaml is fine, the selector skips schedule rows whose case_id no longer exists.

What if the model_primary changes? Does the schedule reset?

By design no, but the engineer can. A model swap PR is the right time to either keep the schedule (and trust the harness to surface what regressed under the new model) or reset every case to box 1 with one CLI flag (--reset-on-model-swap). The default is keep, because most model swaps regress on a small fraction of cases, and the schedule lets the harness pinpoint which cases drifted instead of forcing a 30-day cooldown over the entire set. The decision lives in the same PR as the model_primary edit in rubric.yaml; both are reviewed by the named engineer in the rubric ownership block.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.