Guide, topic: production ai agent evals, 2026
Production agent evals are two clocks, not one. The second one is where the regressions get caught.
Most guides on this stop at the offline harness in CI. That harness only knows about the cases the team has already seen, and the regressions that hit production almost always come from inputs the case set never anticipated. The second clock is what catches them: a 60-line live judge that samples 1 percent of traffic, scores each sampled call async against the same rubric.yaml the offline harness uses, raises a 5-point drift alert against a 7-day rolling baseline, and feeds the worst rows back into eval/cases.yaml every Friday. This guide ships both clocks and the loop between them.
Same shape across Pydantic AI, LangGraph + Bedrock, Anthropic with custom orchestration, an automated nightly DAG, and a multi-model image pipeline.
Monetizy.ai / Upstate Remedial / OpenLaw / PriceFox / OpenArt
What every other guide on this gets wrong
Open the dozen pages that show up first when an engineer Googles this topic. Half of them are platform comparison lists (top five, top five again, top five with the order reshuffled). The other half describe online evaluation as a concept: you should sample production traffic, you should use an LLM-as-judge, you should compare a rolling window to a baseline. All of that is correct. None of it is committable. None of it tells you what the sampling percent is, what the drift threshold is, what the cost per call is at your traffic, what the file at scripts/friday_triage.py looks like, and how a sampled production failure becomes a row in eval/cases.yaml the next Friday.
That gap is exactly where shipped agents fail. The offline harness in CI grades the cases the team thought of. The cases the team thought of always lag the inputs real users send. So the harness gates merges, the agent ships, and the regression that hits users at 3am Saturday came from an input nobody had seen before. The articles that describe Clock 1 alone do not catch that regression. The articles that describe Clock 2 abstractly do not commit anything to your repo to catch it.
This guide is the inverse. Two clocks, named in code. One rubric. The Friday loop between them, with a 30-minute time budget. The same shape we have shipped across five named production agents on five different stacks, model-vendor neutral, no platform license, no vendor-attached runtime.
The two-clock model, in one diagram
Clock 1 is the harness. Clock 2 is the judge. The Friday loop is the line that connects them. Both clocks read the same rubric.yaml, which is the load-bearing point: the model_primary, thresholds, and expected_traits taxonomy do not get to disagree between CI and live traffic.
Clock 1. Pre-merge harness.
Runs in CI on every PR that touches agents/, eval/, or rubric.yaml. Reads eval/cases.yaml (80 rows: 68 from real traffic, 12 adversarial). Scores against rubric.yaml. Posts the scorecard to the PR with gh pr comment. Blocks merge on regression > 3 points. Cadence: per PR, blocking, sub-12-minute. Catches: regressions on cases the team has already seen.
Clock 2. Post-deploy judge.
Runs as a worker against the agent_traces table. Samples 1 percent of live calls with a 50/day floor per agent. Scores each sampled row async with a cheap LLM judge against the same rubric.yaml. Writes judge_scores rows. Compares the 24h rolling score to a 7-day baseline; pages on a 5-point absolute delta. Cadence: continuous, non-blocking, 5 to 30 minute lag. Catches: regressions on inputs the case set never anticipated.
The Friday loop.
Every Friday the engineer pulls the 30 worst-scored sampled rows from judge_scores, triages them in 30 minutes, and lifts the worst 5 to 20 into eval/cases.yaml as new live-YYYY-MM-DD-NNN rows. The next CI run grades against the new case set. The harness, which yesterday only knew about the rows the team had already seen, now knows about the rows production found.
Why two and not three.
A third clock (load tests, red-team simulators) is sometimes useful but not load-bearing. Two clocks is the minimum that closes the loop. Anything less leaks regressions into production; anything more without first closing this loop is overhead.
Same rubric, both clocks.
Both clocks read rubric.yaml. The same model_primary, the same thresholds, the same expected_traits taxonomy. If the offline harness scores a case 0.92 and the online judge scores a near-identical live call 0.61, the rubric did not change; the input distribution did. That gap is the whole signal.
Audit asserts both.
The 6-question shelfware audit (covered in the eval-harness guide) only checks Clock 1. A Clock-2 audit adds three commands: judge.yaml present, judge_live.py runs in last 24h, drift alert ever fired. Together they form a 9-question audit.
Anchor: the numbers Clock 2 actually ships with
Anchor fact
1% sampled. 50/day floor. 5-point drift. $24/month. 30-minute Friday.
- sampling.rate = 0.01 with floor_per_agent_per_day = 50 and cap_per_agent_per_day = 500. At 8K calls per day this lands ~80 sampled rows; on quiet days the floor guarantees 50.
- always_judge tags (tool_error, max_turns_hit, user_negative_feedback) are sampled at 100 percent no matter the rate. Rare, high-signal events do not get diluted.
- drift.page_delta = 5.0 pages on-call. drift.warn_delta = 3.0 warns silently. baseline_window_days = 7 is the window.
- Judge model is claude-haiku-4-5 at ~$0.003 per call. 80 sampled per day ~ $0.24/day ~ $24/month per agent.
- Friday triage budget: 30 minutes. Output: one PR appending 5 to 20 rows to eval/cases.yaml as live-YYYY-MM-DD-NNN ids.
How a sampled production call becomes a judge_scores row, second by second
One real timeline from the Monetizy outbound agent. The agent reply lands in agent_traces at 14:31:58. The next judge tick (every minute) selects rows the sampler picked, scores them with Haiku against rubric.yaml, writes one judge_scores row, and runs check_drift. In this pass the agent dropped 8 points off baseline, which crossed page_delta and posted to #agent-drift.
agent_traces -> sampler -> judge -> judge_scores -> drift
eval/judge.yaml, the config that wires Clock 2
Lives next to eval/cases.yaml in your repo. Read by judge_live.py on every run. Names the same rubric_file the offline harness reads. The sampling, drift, storage, and always_judge blocks below are the exact shape we ship; tune the rate and the deltas, do not invent the schema.
eval/judge_live.py, the worker that does the scoring
About sixty lines. Runs every minute via cron (or every five via a worker queue). select_rows pulls recent agent_traces that have not been judged yet, applying the sampling rate plus the always_judge tag override. score does one Haiku call per row, returning strict JSON. write_score writes one judge_scores row. check_drift compares the 24h rolling mean against the 7d baseline and pages on a 5-point delta. No platform, no vendor runtime, no SaaS sign-in.
Where the live signal flows
Every live agent call lands in agent_traces. The sampler picks 1 percent plus anything tagged always_judge. The judge worker scores against rubric.yaml. The output is one judge_scores row, optionally a Slack page, and a row that is now eligible for the Friday triage queue. Nothing here is vendor-hosted on purpose; the Clock-2 audit fails if any of these surfaces lives behind a third-party login.
agent_traces -> judge -> drift signal -> Friday cases.yaml
What the judge worker prints when drift fires
One real run, edited only for trace ids. Three agents in the same pass. One within band, one with a recent improvement (negative delta), one over the page threshold. The worker exits 0; the page is fire-and-forget through Slack so it cannot back up the judge loop.
What hits Slack when the page fires
One message in #agent-drift. Recent vs baseline, sample sizes, page_delta, on-call rotation, runbook link, the URL to the worst rows in judge_scores, and the recent deploys that could have caused the shift. The on-call engineer needs those six lines to know whether to roll back, re-tune the rubric threshold, or wait for more data.
The Friday loop, minute by minute
The whole point of Clock 2 is that the sampled rows that scored low this week become rows the offline harness scores next week. Without the Friday loop, the judge is a dashboard. With it, the judge is a feedback mechanism that turns production failures into committed test cases the team can never accidentally break again.
16:00 Friday. Pull the 30 worst rows.
python scripts/friday_triage.py prints 30 sampled rows from the last 7 days where judge_score < 0.7, ordered ascending. The output is pre-formatted YAML with id, source (production_log:trace_id), input, must_not_include (auto-populated from violations), rubric_weight 1.0, plus the judge_notes for context.
16:05. Skim for false positives.
Roughly 6 of 30 rows are judge errors. Delete them. We tune the judge against a 100-row calibration set every 30 days; the false-positive rate has held at 18 to 22 percent on cheap-model judges. The Friday triage budget assumes you will throw out a fifth of what the judge surfaced.
16:10. Fill expected_traits on the survivors.
For each surviving row the engineer writes 2 to 4 expected_traits in plain language: 'no fabricated dollar amounts', 'cites the contract clause by section', 'does not start with a greeting'. This is the slow step. 90 seconds per row, 14 rows, ~21 minutes of focused writing.
16:35. Open one PR.
git checkout -b friday-triage-2026-04-26, append the 14 rows to eval/cases.yaml, commit with a message that names the drift agent (if any) and links the slack page. The PR triggers Clock 1, which scores the full case set including the new rows. If the new rows fail, the PR also includes a fix to the agent.
Saturday morning. Verify clock 1 graded the new rows.
The CI scorecard comments on the PR. The new rows show up on the per-case table with their first scores. If the agent has not been fixed yet, those scores are red and the PR is blocked. The harness now grades against rows that did not exist 24 hours ago. That is what production evals look like, mechanically.
30 days later. Recalibrate the judge.
Once a month the engineer pulls 100 random sampled rows, hand-scores them, and compares to the judge model. If the judge agreement drops below 0.78 we either re-prompt the judge or upgrade to a stronger model. Cost per call goes up; signal goes up faster. This is the only Clock-2 maintenance task that does not happen weekly.
scripts/friday_triage.py, the 30-minute script
The output is pre-formatted YAML. The engineer reviews, fills expected_traits per row, deletes the false positives, and appends survivors to eval/cases.yaml in one PR. The id format is live-YYYY-MM-DD-NNN, which is also how Q9 of the audit checks that the loop is alive: grep for live-2026-04 in eval/cases.yaml and confirm the most recent matches a Friday in the last 30 days.
Receipts: what this guide commits vs what other pages describe
Left: the two-clock system, file by file. Right: the platform-comparison or abstract-online-evals pattern that dominates existing pages on this topic. Neither column is wrong; the right column is incomplete.
| Feature | Platform comparison + abstract online evals | Two-clock system |
|---|---|---|
| What this guide describes | Top 5 platform comparisons, or abstract advice to 'sample production traffic' | Two clocks: pre-merge harness + post-deploy judge, with the file-by-file feedback loop between them |
| Sampling rate | Mentions sampling as a concept; never names a percent or a floor | 1 percent of traffic with a 50/day floor and a 500/day cap, plus 100 percent on always_judge tags (tool_error, max_turns, negative_feedback) |
| Drift signal | Charts and dashboards; no threshold names | 5-point absolute delta from a 7-day rolling baseline pages on-call. 3-point delta warns. Tuned quarterly against incident counts |
| Cost | Vendor pricing pages, no per-call number against a real workload | $0.003 per judge call, ~80 sampled per day at 8K traffic, ~$24 per month per agent |
| Feedback into the offline harness | Treats online and offline as separate tools owned by separate teams | Friday triage script prints 30 worst rows; engineer lifts 5 to 20 into eval/cases.yaml as live-YYYY-MM-DD-NNN; CI grades the new rows on the next PR |
| Same rubric, both clocks | Online judge is a SaaS dashboard with its own scoring model, separate from the CI rubric | Both clocks read rubric.yaml. The judge does not have its own scoring code path |
| Vendor lock-in | Platform license, vendor-attached runtime, agent traces stored in vendor DB | 60 lines of Python, 30 lines of YAML, one cron entry, one Slack webhook. Runs on the same Postgres your agent already uses |
| What the audit asks | Defines 'production evals' as the existence of a dashboard | judge.yaml present, judge_live.py ran in last 24h, drift alert has fired in last 90 days, eval/cases.yaml has at least one live-YYYY-MM-DD-NNN row in last 30 days |
The 8-bullet pass-state of a live Clock 2
- eval/judge.yaml exists and references rubric.yaml
- eval/judge_live.py has run in the last 24 hours (cron is wired)
- judge_scores has at least one row per agent in the last 24 hours
- Drift alert has fired in the last 90 days (the alert is real, not theoretical)
- always_judge tags include tool_error, max_turns_hit, user_negative_feedback at minimum
- eval/cases.yaml has at least one live-YYYY-MM-DD-NNN row in the last 30 days
- Friday triage commit is signed by a human, not a bot, and arrives weekly
- Judge calibration was run in the last 30 days against a 100-row hand-scored set
Counts that anchor Clock 2
Engagement-level numbers, the same ones eval/judge.yaml ships with. Per-client production metrics live on /wins.
0% sampled. 0 floor per agent per day. 0-point drift threshold. $0 per agent per month. 0 min Friday triage. The five numbers that distinguish a live second clock from a described one.
“One rubric, two clocks, one Friday loop. eval/judge.yaml + eval/judge_live.py + scripts/friday_triage.py + ops/drift_runbook.md, committed to your repo, running on your Postgres, paging your Slack. Same shape across Pydantic AI on Monetizy, LangGraph + Bedrock on Upstate Remedial, custom orchestration on OpenLaw, an automated nightly DAG on PriceFox, and a multi-model image pipeline on OpenArt. Model-vendor neutral, no platform license, no vendor-attached runtime.”
PIAS leave-behind across 5 named production agents
Want a senior engineer to wire Clock 2 against your live traffic?
60-minute scoping call with the engineer who would own the build. You leave with a committable eval/judge.yaml tuned to your traffic shape, the drift threshold for your workload, and a fixed weekly rate to land judge_live.py, friday_triage.py, and the runbook in your repo.
The two clocks, the Friday loop, the numbers, answered
What does production AI agent evals actually mean? Is it the same as offline evals?
Production agent evals is two systems running on different clocks. Clock 1 is the offline harness in CI: a fixed case set in eval/cases.yaml, a rubric in rubric.yaml, a workflow in .github/workflows/eval.yml that grades every relevant PR and posts a scorecard comment. Clock 2 is the online judge: a worker that samples a fraction of live traffic, scores each sampled call async with a cheap LLM judge against the same rubric, writes judge_scores rows, and raises a drift alert when the rolling score moves off baseline. Most articles only describe Clock 1, then call it production evals. The regressions that actually hit users in production almost always ship through Clock 1 because the case set was stale; Clock 2 is what catches them and feeds them back.
Why 1 percent sampling? Why not score every call?
Two reasons. First, cost. At 8K calls per day with a Haiku-class judge at roughly $0.003 per call, sampling 1 percent costs about $24 per agent per month. Sampling 100 percent costs about $2,400. The signal-to-cost ratio of full sampling is poor on most agents because rubric scores are stable enough that 80 sampled rows per day, plus 100 percent of always_judge-tagged rows (tool_error, max_turns_hit, user_negative_feedback), already catch a 5-point drift inside 24 hours. Second, latency. The judge is async on a worker, but it shares an LLM provider with the agent. At 100 percent sampling the judge competes for rate limit with the agent itself, which is the worst kind of self-induced incident. 1 percent with a 50/day floor and a 500/day cap is the band we have shipped in five named production agents and never had to retune mid-quarter.
How does the drift signal work? Why 5 points off a 7-day baseline?
Every minute the judge worker computes two averages from judge_scores: the recent 24-hour mean and the prior 7-day mean (excluding the most recent 24 hours). The drift is the absolute point delta between them. Below 3 points is noise, dominated by sample variance. 3 to 5 points warns in #agent-drift-warn but does not page. Above 5 points pages on-call via #agent-drift, links the runbook, and surfaces the recent deploys (agent code, rubric.yaml, model_primary). The 7-day baseline is long enough to average over a single bad batch but short enough to track legitimate distribution shifts. We tune the warn band quarterly against actual incident counts: if a quarter ends with 3 unpaged incidents that the judge saw at 4-point drift, we lower the page threshold to 4 and recheck the false-page rate.
Won't the LLM judge produce false positives? How do you trust it?
It produces false positives at roughly 18 to 22 percent on a cheap-model judge against our rubrics, and we plan for that. Three controls keep the system honest. First, the judge does not block anything. Its only side effect is a row in judge_scores and a Slack page on aggregate drift, never per-call. Second, drift is computed on aggregate, so an 18 percent false-positive rate is washed out by averaging. Third, every 30 days the engineer hand-scores 100 random sampled rows and compares to the judge. If agreement drops below 0.78 we re-prompt the judge or upgrade to a stronger model and accept the higher per-call cost. The judge is calibrated against the rubric, not anointed.
What does the Friday loop actually look like? Walk me through it.
16:00 Friday: scripts/friday_triage.py prints the 30 worst-scored sampled rows from the last 7 days as pre-formatted YAML, ordered ascending by score. 16:05: skim for false positives, delete roughly 6 rows. 16:10: write 2 to 4 expected_traits per surviving row in plain language: 'no fabricated dollar amounts', 'cites the contract clause by section'. This is the slow step, ~90 seconds per row, ~21 minutes for 14 rows. 16:35: open one PR titled friday-triage-YYYY-MM-DD that appends those rows to eval/cases.yaml as live-YYYY-MM-DD-NNN ids. Saturday morning: the offline harness scores the new rows for the first time on the next PR. The case set yesterday did not include those rows; today it does. That is the loop, end to end, in about 30 minutes of human time per week.
How is this different from LangSmith, Arize, Braintrust, Langfuse, Maxim, or DeepEval?
Those are tools the system uses or could use; they are not the system. Langfuse and LangSmith provide an agent_traces equivalent and a UI for sampled scoring; you can swap either of them in as the trace store and a piece of the judge runner. Arize and Braintrust ship LLM-as-judge primitives. DeepEval gives you metrics. None of them ship the rubric.yaml that both clocks share, the eval/cases.yaml schema, the Friday triage script, the drift threshold tuned to your workload, or the runbook that names what on-call does at 3am Saturday. Our stance is: the platform is a runtime detail, the system is what your engineer commits to your repo. Picking a platform does not produce the system; the engineer writing rubric.yaml, judge.yaml, judge_live.py, and friday_triage.py produces the system.
What does an audit of Clock 2 look like? Is it the same as the 6-question shelfware audit?
It is three more questions on top. Q7: does eval/judge.yaml exist and reference rubric.yaml? grep -q 'rubric_file: "rubric.yaml"' eval/judge.yaml. Q8: has eval/judge_live.py run in the last 24 hours? Check the cron log or 'last_run' table. Q9: has a drift alert ever fired? grep the Slack channel history for the page string, or query an audit table. Q9 is the hardest to pass and the most diagnostic: a Clock-2 system that has never raised a single drift signal in 90 days is either against an agent on rails or, more commonly, a system whose threshold is too loose. We pair Q9 with a pre-prod chaos run: deliberately ship a known-bad model_primary in staging and confirm the judge pages within 60 minutes. If it does not, the threshold or the sampling rate is wrong.
Will this work for non-text agents (vision, audio, code)?
Yes, with judge swap. The shape of agent_traces, judge_scores, judge.yaml, judge_live.py, the Friday triage script, the drift formula, and the cases.yaml feedback are unchanged. What changes is the judge: a vision-model judge for image outputs, a sandboxed Python judge for code generation that compiles and runs the output against a unit test, an audio-model judge that transcribes and grades. Per-call cost goes up (vision is roughly 5x text Haiku, code is dominated by the sandbox) and you adjust the sampling rate accordingly. We have shipped the shape on a multi-scene image pipeline (cost-tuned to 0.5 percent sampling) and on email personalization (1 percent), with the same Friday loop on both.
What if our agent is high-stakes (legal, medical, financial)? Is sampling enough?
Sampling is enough as a regression detector but not enough as a safety floor. For high-stakes workloads you keep the 1 percent baseline sample for drift and add a 100 percent always_judge filter on the rows the rubric says matter most: every output that includes a numeric value, every output that cites a clause, every output that names a person. The cost goes up but you only judge the high-risk fraction. On a legal-tech engagement we judge 100 percent of citation-bearing outputs and 1 percent of the rest; ~12 percent of all outputs end up sampled, monthly cost lands near $290 per agent. That is the high-stakes shape.
What does this leave behind when the engagement ends? Are we vendor-locked?
Nothing in Clock 2 is vendor-attached. eval/judge.yaml, eval/judge_live.py, scripts/friday_triage.py, and ops/drift_runbook.md live in your repo. The trace store is your existing Postgres (or Snowflake, or Clickhouse) with two new tables: agent_traces and judge_scores. The judge model is named in judge.yaml as a one-line string and can be swapped to any provider in one PR. The Slack channel is yours. There is no SaaS license, no platform sign-in, no agent_traces hosted elsewhere. The engagement leaves a senior engineer named in rubric.yaml ownership block, the system above, and a 9-question audit you can run on the repo every Friday afternoon. If you fire us tomorrow, your team owns and operates Clock 2 unchanged.
Adjacent guides
More on the two clocks and the leave-behind that defines production-ready
AI Agent Eval Harness: the 6-question shelfware audit and the 5-PR build order
The pre-merge harness this guide assumes. Six shell commands that audit it, five PRs that build it, the same rubric.yaml that the live judge reads.
Evaluating new LLM releases for production agents (April 2026)
When a new model lands, you flip model_primary in rubric.yaml. Both clocks then re-grade against the same rubric: the harness on the next PR, the judge on the next sample.
AI agents in production: the 6-week contract rubric
Week 2 prototype gate, week 6 production rubric, leave-behind. The shape of the engagement that produces both clocks and the loop between them.