Guide, topic: ai poc production regression tail

Your POC passed at 0.86. The regression tail is a weekly pull request you have not written yet.

Every honest AI agent in production has a long tail of inputs the POC rubric never saw. The PIAS week-6 leave-behind turns that tail into ordinary code review. A scheduled job reads the last seven days of traces, clusters the spans that would fail your own rubric right now, and opens a PR with the cases already written in YAML. The tail shrinks, or you get paged. No evaluation platform license. No vendor runtime. File names below.

M
Matthew Diakonov
15 min read
4.9from 5 named production agents, tail-mining loop live on each
agents/tail_miner.py + .github/workflows/tail-mine.yml, cron 0 9 * * 1
eval/tail/<ISO-week>.yaml promoted into eval/cases.yaml on merge
Same 0.82 rubric and 0.78 ragas thresholds as the week-2 gate

Shipped on Monetizy.ai, Upstate Remedial, OpenLaw, PriceFox, OpenArt. Model-vendor neutral; your trace store, your keys, your CI.

eval/tail/2026-W17.yaml -- 17 cases, 3 regulatory-weighted, PR #631

production tracesOTEL spanslow-score clusterragas faithfulnessrubric_min_score 0.82per_case_regression 3eval/cases.yamleval/tail/2026-W17.yamltail_miner.pyweekly Monday PRscheduled workflowmodel vendor neutralAnthropicOpenAIBedrockVertex

The quiet month-three problem

Everyone publishes the same advice about moving an AI proof-of-concept into production: write a solid eval, ship it, monitor it. Fine as far as it goes, but every honest production team arrives at the same month-three surprise. The POC passed at 0.86 against 80 hand-written cases. Ten weeks in, users are typing things the rubric author never imagined, combining features in one turn, quietly drifting the input distribution one complaint at a time. The agent is worse than its rubric would say it is, and nobody quite knows by how much.

POC eval passes at 0.86 on 80 hand-written cases. Agent ships. Ten weeks later the on-call engineer gets a user complaint, opens the trace store, scrolls, guesses at a reproducer, writes one test if there is time. Most complaints never become tests. The rubric is frozen at week-2 shape. The regression tail keeps growing.

The mistake is treating the POC eval as a static document. The fix is a weekly job that promotes real failures into the rubric the next model qualification will be evaluated against. The discipline is making it a pull request, not a dashboard view.

Where regressions actually hide

Five clusters we see repeatedly across the five shipped agents. None of these categories are theoretical; each appears in a real eval/tail/ file on main.

Inputs phrased in a way the rubric author did not anticipate

Your eval writer imagined users saying "summarize this contract". Real users say "tldr this, but only the parts about indemnity". The intent is the same; the surface form is outside the rubric's coverage until the tail miner drops it in.

Inputs that combine two supported features in one turn

One ticket asks for a draft AND a citation check AND a tone rewrite. The POC evaluated each of those in isolation. The combined failure mode only shows in production traffic.

Adversarial drift: users learn the agent's shape and probe it

Month three, users start including "ignore previous instructions" for real reasons (they are copying from another tool). The rubric did not include that surface form until a user's frustrated retry got scored low and landed in the tail file.

Upstream data drift: the documents your retriever sees change

Your retriever indexes a new doc class on a Wednesday. By Friday, 2 percent of answers are wrong in a specific, repeatable way. The tail miner catches it on Monday morning, not in a post-mortem three weeks later.

Rare but high-cost cases: the 0.2 percent that are regulatory

On Upstate Remedial, a miscited statute is not "one bad answer", it is a compliance event. The tail miner weights these at 10x when clustering, so rare but expensive cases cannot get crowded out by higher-volume noise.

Anchor fact: the three files that make the tail a weekly PR

The week-6 leave-behind adds three artifacts to main that together make the regression tail a first-class part of the repo. None of them are hosted by us, none of them require a platform license, and all three are named the same way on every shipped agent.

Anchor fact

Three files, one cron, one reviewer.

  • agents/tail_miner.py. Reads the last seven days of spans from your trace store, filters by rubric_score < 0.82 or ragas < 0.78or explicit negative feedback, clusters by intent embedding, weights regulatory/safety spans at 10x, writes eval/tail/<ISO-week>.yaml.
  • .github/workflows/tail-mine.yml. Scheduled with cron: "0 9 * * 1". Runs tail_miner, runs a dry-run rubric against the mined cases, opens a pull request labeled tail-miner, eval with the engineering lead as reviewer. If two consecutive weeks escalate, pages the engagement owner.
  • eval/tail/.A directory of weekly YAML files, named by ISO year and week, reviewable in PRs, promoted into eval/cases.yaml on merge. Rejected cases stay in eval/tail/ with a rejection_reason, because “we decided this is not a regression” is itself load-bearing documentation for the engineer who inherits the rubric two quarters later.

agents/tail_miner.py, the exact shape we ship

This is the miner we put on main in the week-6 leave-behind. It reads from an adapter, not a specific trace vendor, so it works with OTEL, Langsmith, Arize, or a plain Postgres table. The thresholds match rubric.yaml exactly. The cost weighting keeps rare high-stakes cases from getting drowned out.

agents/tail_miner.py

.github/workflows/tail-mine.yml, one Monday at a time

The cron is the discipline. Once this file is on main, Monday 09:00 UTC produces a pull request whether or not anyone remembers it is supposed to. The escalation path at the end makes a growing tail impossible to hide.

.github/workflows/tail-mine.yml

A literal Monday morning, traced

Captured from a recent run. No human kicked this off. No dashboard was opened. The engineering lead saw the PR in their morning inbox and started reviewing.

tail-mine.yml -- 2026-04-21 09:00 UTC

Promotion: eval/tail/ folds into eval/cases.yaml

The same file that every model qualification PR is scored against gains the promoted tail rows on merge. Notice the source: tail_miner and cost_weight fields; they are how a reviewer two quarters later can see where a case came from and why it is weighted the way it is.

eval/cases.yaml

The loop, step by step

The checklist the engineering lead internalizes during the week-6 handoff, so the weekly review runs without PIAS in the room.

1

Monday 09:00 UTC -- the scheduled workflow fires

GitHub Actions runs .github/workflows/tail-mine.yml. No human opens a dashboard. No engineer remembers to kick it off. The cron is the discipline.

2

tail_miner.py reads the last 7 days of production traces

It pulls spans from whatever trace store you already run: OTEL collector, Langsmith, Arize, a managed Postgres table. No dependency on a PIAS-hosted runtime. Your keys, your data, your rate limits.

3

Spans below rubric_min_score 0.82 or ragas 0.78 are selected

The same thresholds that gated the week-2 prototype. If a span would have failed the original rubric, it belongs in the tail. User-negative-feedback spans are always pulled in regardless of score.

4

Failing spans cluster by intent embedding, weighted by cost label

Clusters are the unit of review, not individual spans. A cluster of 9 "tldr but only the indemnity parts" spans becomes one case, not nine. Regulatory and safety-labeled spans get a 10x weight so rare high-cost cases cannot be crowded out by higher-volume noise.

5

eval/tail/<ISO-week>.yaml is written and a PR is opened

The file name is load-bearing: searching the repo for eval/tail/2026-W17.yaml shows you exactly which cases were mined that week. The PR is labeled tail-miner, eval and assigned to the engineering lead. The PR body includes the dry-run scorecard: how many of the mined cases your current model_primary already handles (usually 10-20 percent of them, not zero, not all).

6

Reviewer edits cases, then merges into eval/cases.yaml

The reviewer reads the cluster representative, edits the expected-output field if the auto-written one missed a nuance, and marks cases as promoted or rejected. Promoted cases become rows in eval/cases.yaml and are now part of every future regression check, including every new-model qualification PR. Rejected cases are preserved in eval/tail/ with a rejection reason, because "we decided this is not a regression" is itself load-bearing documentation.

7

The tail shrinks, or you get paged

If two consecutive weeks produce fewer than three new clusters, the tail is closing and the agent is stable on its current scope. If the tail keeps growing week over week, agents.tail_miner.escalate fires a GitHub issue assigned to the engagement owner and a 30-minute scoping call is triggered. An unresolved tail is the earliest honest signal that the rubric itself needs rework, not a prompt tweak.

The wiring: traces in, tail PR out, nothing vendor-hosted in between

Left: everything the miner reads. Center: the hub, which is the same rubric thresholds the week-2 gate was scored against. Right: what the miner produces. Every arrow is a concrete file or API call in your repo, not a hosted service.

7-day traces -> rubric thresholds -> weekly PR

OTEL / Langsmith / Arize / Postgres
rubric_score, ragas, user_feedback
Regulatory and safety labels
rubric thresholds
eval/tail/<ISO-week>.yaml
Dry-run scorecard on PR
Merge into eval/cases.yaml

What the common playbook gets wrong

Left: the PIAS week-6 leave-behind. Right: the shape most teams end up with when the POC eval is treated as a one-time artifact. Every row is a difference we have watched matter on a real Monday morning.

FeaturePOC eval, shipped as-isPIAS leave-behind
How regressions in rare inputs are discoveredAn on-call engineer notices a complaint, opens the trace store, guesses at a reproducer, writes a test by hand if they have time.A scheduled job samples the last 7 days of traces with rubric_score < 0.82 or ragas < 0.78, clusters them, and opens a PR with the cases already written in YAML.
Where the tail livesIn a vendor observability dashboard, behind SSO, not version-controlled, invisible to the engineer writing the next feature.eval/tail/<ISO-week>.yaml in your repo. One file per week, reviewable in a PR, promoted into eval/cases.yaml on merge.
What enforces the disciplineA recurring calendar event that gets moved twice, then skipped, then nobody remembers the ritual existed.A GitHub Action with cron: "0 9 * * 1" that the engineering lead cannot forget to run. If it fails to open a PR, the workflow fails and paging rules fire.
Threshold for pulling a trace into the tailWhatever an analyst hand-picks while scrolling a UI, usually the top of the chart, not the long tail.rubric_score < 0.82 OR ragas_faithfulness < 0.78 OR explicit user negative feedback. Same thresholds that gated the week-2 prototype.
Relationship to the model qualification PROne-off bug reports that disappear from tribal memory when the engineer who found them leaves.Every mined case joins eval/cases.yaml, so the next time model_primary changes, the harness evaluates the new model against the tail too. The tail becomes a permanent asset.
What happens when the tail does not shrink over two consecutive weeksNothing. The tail quietly keeps growing until a user-facing incident surfaces one of its cases.The scorecard comment escalates, the engagement owner is paged by GitHub, a 30-minute scoping call is triggered to rework the rubric or the orchestration.
Vendor dependencyUsually requires a specific eval platform or a specific trace vendor, renewal leverage included.Reads whatever trace store you already have (OTEL, Langsmith, Arize, your own Postgres). No PIAS-hosted runtime, no platform license.

The same loop on five shipped agents

Each card is a named agent with the tail miner running on main. Stacks, trace stores, and cadences differ. The three-file shape does not.

Monetizy.ai -- ~8K outbound emails per day

The tail miner caught a cluster around "retries after a dropped session" that the POC rubric had no row for. Four new cases landed in eval/cases.yaml on the 2026-W08 PR. Per-case regression has held green since.

Upstate Remedial -- 400K+ legal-compliance emails

Regulatory-labeled spans get the 10x weight. On 2026-W03 a single 2-span cluster about out-of-state statute citations outranked a 40-span cluster about tone, and the scoping call was triggered the same afternoon.

OpenLaw -- citation verification on every draft

The tail miner runs twice, once for the drafter and once for the verifier. Cases promoted to the verifier's eval/cases.yaml have blocked two model qualification PRs that otherwise would have cleared the drafter's rubric alone.

PriceFox -- multi-tenant retrieval agent

Because the tail miner runs on the last 24 hours (not 7 days, more traffic), each tenant has its own eval/tail/<tenant>/<ISO-week>.yaml. Cross-tenant cases land in the shared file. No tenant sees another's traces.

OpenArt -- per-scene DAG, multi-model inference

The tail is mined per node class: prompt-repair, shot-plan, continuity-check. A single low-score scene becomes a case in the node's eval/tail/, so a bad continuity pass cannot silently pollute prompt-repair's rubric.

7

Seven days of traces, twenty clusters at most, one PR. The regression tail becomes a file you can read, not a meeting you have to schedule. That is the entire difference between an AI agent that decays after launch and one that compounds.

PIAS leave-behind pattern on Monetizy.ai, Upstate Remedial, OpenLaw, PriceFox, OpenArt

Receipts

File counts and cadence facts, not invented benchmarks. Per-client production metrics are on /wins.

0Files a week-6 leave-behind adds for tail mining: tail_miner.py, tail-mine.yml, eval/tail/
0Days of production traces each weekly run samples
0.0Rubric threshold below which a span drops into the tail
0 USDPlatform license cost for the tail-mining loop

The 0-file rule is the same discipline that makes new-model qualification a 0-PR job. Everything else, 0 USD in platform license, 0 vendor-hosted runtimes, on purpose.

Want your regression tail to be a Monday PR on your repo, not a quarterly firefight?

60-minute scoping call with the senior engineer who would own the build. You leave with a one-page week-0 memo: the rubric thresholds, the trace adapter, the tail-mining schedule, and the weekly rate.

Book the scoping call

AI POC to production, the regression tail questions

What exactly is the "regression tail" in an AI proof-of-concept taken to production?

It is the set of production inputs that your POC eval did not cover and that your agent now handles worse than its rubric would pass. The POC was evaluated on 80 or so hand-written cases. Production sees thousands of prompts per day with a much wider surface: different phrasing, combined features, adversarial drift, upstream data drift, and rare-but-high-cost regulatory cases. The regression tail is the portion of that traffic where the rubric, if you could run it live, would score below your own threshold. The PIAS leave-behind makes that scoring a weekly PR instead of an annual postmortem.

Why a weekly PR? Why not just evaluate every trace in real time?

Two reasons. First, real-time eval on every trace is expensive and noisy; you end up with an alert-fatigued on-call channel and no signal. Second, the unit that matters for regression testing is the cluster, not the single trace. A single low-score span could be a user typo. Nine similar low-score spans clustered by intent are a real rubric gap. The weekly cadence is how long it takes to accumulate enough spans to cluster meaningfully on most mid-scale production agents (tens of thousands of spans per week). On higher-volume agents like PriceFox we run it daily on the last 24 hours; on smaller agents monthly is fine. The PR shape does not change, only the window.

Where do the mined cases actually live in the repo? Can I see the file names?

Yes, by design. Weekly mined cases land at eval/tail/<ISO-week>.yaml (for example, eval/tail/2026-W17.yaml). The mining script is agents/tail_miner.py. The cron-triggered workflow is .github/workflows/tail-mine.yml with cron: "0 9 * * 1" (Monday 09:00 UTC). Promoted cases fold into eval/cases.yaml alongside the original POC cases. Rejected cases stay in eval/tail/ with a rejection_reason field. All four artifacts are in the repo the engagement hands over on day 42, not in a dashboard we own.

How do you decide a trace belongs in the tail and not in normal traffic?

Three triggers, any of which is sufficient. (1) rubric_score below 0.82 when the trace is scored against the rubric that lives on main. (2) ragas_faithfulness below 0.78 if the agent is retrieval-augmented. (3) explicit negative user feedback (thumbs down, regeneration, user-provided correction) regardless of score. The thresholds match exactly the thresholds that gated the week-2 prototype and that gate every new-model qualification PR. Using different thresholds for the tail and the rubric is the most common way this loop degrades over time; we keep them synchronized.

What stops the tail miner from drowning the team in low-value cases?

Three mechanisms. (1) Clustering by intent embedding collapses similar spans into one case; a cluster of 50 paraphrases of the same gap is one PR row, not 50. (2) Cost-weighting: regulatory- and safety-labeled spans are weighted 10x so rare but high-cost clusters outrank high-volume low-cost noise when the miner picks its top N. (3) A per-run cap of 20 clusters, so even a bad week is a reviewable PR. If you hit the cap two weeks in a row, that is itself a signal that the rubric or orchestration needs rework, and an escalation is triggered automatically.

How does this relate to the model-qualification PR and the week-2 gate?

The tail-mining loop and the qualification loop share the same eval/cases.yaml. The week-2 gate establishes the initial case set and the thresholds. The week-6 leave-behind ships both tail_miner.py and the qualification PR shape. Every case the tail miner promotes becomes a permanent row, so the next time model_primary changes in rubric.yaml (new Anthropic, OpenAI, Bedrock, or Vertex release), the new model is evaluated against the tail too, not only the original 80 POC cases. A model that passes the POC but regresses on the accumulated tail cannot merge. This is how the rubric grows into a moat.

What if we do not have a good trace store yet? Is this loop still usable?

Yes. The tail miner reads from an adapter interface, not a fixed vendor. The adapters we ship on the leave-behind cover OTEL collectors, Langsmith, Arize, and a plain Postgres table with input, output, labels, user_feedback columns. The minimum viable trace store is a Postgres table the agent writes to on every request. If you have nothing today, week 1 of the engagement sets up that table and writes the adapter; the tail miner starts being useful from week 3 onward with modest sample sizes, and hits its stride around week 6 when the cron begins firing on production traffic.

What does a reviewer actually do when the Monday PR opens?

Three things. (1) Read the dry-run scorecard in the PR body: how many of the mined cases the current model_primary already handles. That sets expectations; usually 10 to 20 percent pass already, which is a healthy signal that clustering is finding real gaps and not random variance. (2) Open each cluster, read the representative input and the auto-written expected-output, and edit the expected field if the auto-writer missed a nuance. This takes about a minute per cluster on a mature engagement. (3) Mark cases as promoted (fold into eval/cases.yaml) or rejected (stay in eval/tail/ with a rejection_reason). A typical review is 15 to 25 minutes on a typical week.

What happens if the tail keeps growing week over week and does not shrink?

That is the loudest honest signal an AI agent in production gives you. It means the rubric is missing a whole capability your users are exercising, or the orchestration itself has a systemic gap (wrong retriever, wrong tool, wrong decomposition), or the underlying model is drifting on your workload. The escalation path is automatic: after two consecutive weeks where the cluster count goes up, agents.tail_miner escalate opens a GitHub issue assigned to the engagement owner and triggers a 30-minute scoping call. At that point we are not editing the tail file; we are rewriting part of the rubric or the agent graph, with a named engineer and a new week-0 style memo.

Is this tied to a specific model vendor, a specific eval framework, or a specific orchestration library?

No to all three. The five shipped agents on the leave-behind use different primaries (Anthropic direct, OpenAI direct, Bedrock, Vertex, multi-provider inference), different orchestration (Pydantic AI, LangGraph, custom DAG), and either ragas or a homegrown rubric scorer. The tail-mining loop is the same file shape on every one. model_primary is a string in rubric.yaml, the trace adapter is a class in agents/tail_miner.py, and the workflow file is the same YAML. Model-vendor neutrality is part of the leave-behind contract.