Guide, topic: the monitoring contract that lives in the client repo
A production agent monitoring system is what the on-call engineer reads at 2am, not what the dashboard shows.
Most guides about this topic pick a vendor, draw a three-pillar diagram, and stop. The thing that decides whether your agent actually recovers from an incident is whoever the alert pages and what they read when they open the page. This guide is the four-file contract we ship into client repos so that whoever gets paged has, in the same repo as the agent, the alert spec, the runbook, the reproduction notebook, and a place to sign the seal.
In one paragraph
The four-file contract: monitoring/alerts.yaml is the alert spec, every alert id pointing at a runbook file in monitoring/runbooks/ with five required sections, every fired incident lands a notebook in monitoring/repro/, and every closed alert commits a signed seal to monitoring/seals/with the regression case id added to the eval set. A pre-merge lint refuses any PR that adds an alert without a runbook, a runbook that is missing a required section, or a seal that has no regression case. All four directories live in the client's repo. Vendor dashboards stay where they are; the source of truth moves into the repo where the agent lives.
The thing the existing playbooks leave out
Open ten guides about monitoring AI agents and you will read the same diagram. Three pillars: traces, metrics, evals. Pick a vendor (LangSmith, Langfuse, Arize, Braintrust, Maxim, or a roll-your-own OTel stack). Wire up token cost, latency, judge scores. Build a dashboard. Set thresholds. The diagrams are fine. The vendors are fine. The diagrams are also exactly what the on-call engineer is not reading at 2am when an alert fires.
The thing the engineer is reading is whatever document tells them what the alert means and what to do about it. In every client codebase we have walked into where monitoring already existed, that document was either missing, three teams of indirection away in a Notion the engineer did not have access to, or stale because the original author left and nobody owned it. The dashboards were beautiful. The response was tribal knowledge. The agent went down for six hours because the person who knew what threshold meant what was on vacation.
So the part of monitoring nobody writes about is the response layer: the alert spec, the runbook, the reproduction artifact, and the closing seal. The fix is to put all four in the repo, paired with each other, and to make the pairing enforceable at merge time. That is the four-file contract. The rest of this guide is what each file looks like and how the lint that holds them together is wired.
The four files, in the order an incident touches them
The monitoring system is four directories under monitoring/ in the client repo. The order below is the order an incident moves through them: the alert fires from a row in alerts.yaml, the runbook tells the engineer what to do, the reproduction notebook captures the investigation, the seal closes the loop with a regression case in the eval set.
Part 1: monitoring/alerts.yaml — the alert spec, in the repo
Every alert that can fire on the agent is one entry in a YAML file inside the client's repo. Each entry carries an alert_id, a one-line trigger description in human language, the threshold expression, the severity, the named owner on the engagement team, and a runbook_path that points at a markdown file under monitoring/runbooks/. The pre-merge lint refuses any change to alerts.yaml that adds an alert_id whose runbook_path does not resolve to an existing file. There is no such thing as an alert without a runbook in the repo.
Part 2: monitoring/runbooks/<alert_id>.md — five required sections
One markdown file per alert. Five required sections, in this order: What this alert means in human terms. What the user is seeing right now. Three commands to run to reproduce the issue. Three diagnostic queries to localize the cause. The patch protocol and the seal commitment. The lint that runs on PRs against the runbooks directory enforces the section headers exactly. A runbook that is missing the patch protocol section cannot be merged.
Part 3: monitoring/repro/<incident_id>.ipynb — the reproduction notebook
When an alert fires, the on-call engineer's first action is to copy a template notebook into monitoring/repro/<incident_id>.ipynb and pull the failing trace IDs into it from the production trace store. The notebook is committed to the repo, even when the incident is mid-flight. The next on-call engineer who hits the same alert reads the previous incident's notebook before opening a new one. The repro folder becomes the institutional memory of what has gone wrong, in the same place as the code.
Part 4: monitoring/seals/<date>.md — the signed seal
Every fired alert closes with a seal. The seal is a short markdown file with the alert_id, the incident_id, the reproduction notebook link, the patch shipped, the regression test added to the eval set, and the signature of the engineer who closed it. No ticket-system handoff replaces this. The seal lives in the repo because the next person to look at this slice of the agent will need it, and the repo is the only place the next person reliably has access to.
The alert spec: alerts.yaml
One file. Each alert is a row with an id, a human-language trigger, a severity, a named owner, a runbook_path, and the threshold expression. The shape we ship into client repos looks like this. The alert ids below are the ones we have seen earn their keep on real engagements.
The runbook: five sections, in this order
One markdown file per alert id. The five sections below are the ones we found load-bearing after a year of writing these. The lint enforces the headers exactly. The on-call engineer reads section 1 to understand the alert, section 2 to scope user impact, section 3 to reproduce on their laptop, section 4 to localize, and section 5 to know what they are allowed to ship. A redacted real runbook for the latency alert above looks like this.
The lint: how the four files stay paired
The whole contract collapses without enforcement. A runbook that gets stale, an alert that ships without a runbook, or a seal that closes without a regression case is the failure mode the contract was meant to prevent. So the lint runs on every PR that touches monitoring/. Output below is from a real PR that tried to add a new alert without writing the runbook first.
The CI gate: three checks, one workflow
The lint is wired into a GitHub Actions workflow that runs on every PR that touches monitoring/. Three checks: the alert-runbook pairing, the model snapshot pin against evals/trust_card.md, and the seal completeness check. If any of the three fails, the PR cannot merge.
Repo-owned response vs vendor-owned response
The comparison below is between the four-file contract and the more common shape where alerts live in a vendor dashboard, runbooks live in a wiki, and the response is tribal knowledge. The vendor stack is fine for traces and metrics; the comparison is about where the alert plus response live, not about which trace store you use.
| Feature | Dashboard-and-wiki monitoring | Four-file contract in the repo |
|---|---|---|
| Where the alert lives | Inside a vendor dashboard, behind a UI that the on-call engineer may not have access to at 2am, configured by whichever engineer set it up six months ago. | monitoring/alerts.yaml in the client's repo, versioned with the agent code, reviewed in the same PR that introduces the metric, paired with a required runbook file at a known path. |
| Where the response steps live | A Notion page, a Confluence article, a stale Google doc, or no document at all. The engineer pages the original author and waits. | monitoring/runbooks/<alert_id>.md, with a five-section template enforced by lint. The on-call engineer reads it directly from the repo, on the same machine where they will ship the patch. |
| What happens when an alert is added | Alert is added through the dashboard, runbook is a TODO that the engineer who added the alert intends to write, and the runbook is never written. | PR fails at lint until a paired runbook file exists at monitoring/runbooks/<alert_id>.md and contains the five required sections. Adding the alert and writing the runbook are the same merge. |
| What happens when the alert author leaves | Alert keeps firing. Nobody knows the threshold rationale. The alert is muted because it 'is noisy' and a real incident slips through six weeks later. | The runbook is in the repo. The owner field in alerts.yaml is reassigned in the next PR. The next on-call has the same access to the runbook the original author had. |
| How a fired alert closes | Slack thread is marked resolved. The eval set is not updated. The same regression fires again three weeks later and the team treats it as a new incident. | monitoring/seals/<YYYY-MM-DD>.md is committed with alert_id, incident_id, the patch PR, and the regression case ID added to the eval set. Lint refuses a seal that lacks a regression case. |
| What the engagement leaves behind | A vendor account, a renewal date, and a dashboard nobody on the client's team configured. When the contract lapses or the vendor pivots, the monitoring goes with them. | alerts.yaml, runbooks/, repro/, seals/, the lint script, and a CI gate that runs the lint on every PR. All four directories continue to be the source of truth after the engagement ends. |
Where this connects to the eval harness
The seal is the seam. Every fired alert closes with a seal file that names the regression case id added to the eval set. Without that field the seal lint flags the alert as unfinished and the alert stays open. Over a year of engagements that loop turns the eval set into a record of every production failure mode the agent has surfaced; the cases come from real incidents, not from synthetic brainstorms. The judge calibration that makes those cases graded consistently lives in the eval set trust card; the silent_model_snapshot_rotation alert in alerts.yaml reads from that trust card at runtime. The trace pipeline that fills monitoring/repro/ notebooks is the same pipeline that grades production for the eval-to-production correlation gate, covered in our trace pipeline guide. The four-file contract is the response layer that sits on top of those two systems.
One small detail that pays for itself: the seal template has a required field for "what would have caught this in eval", even when the answer is honest about failure ("the failing intent was not represented in the eval set; case added"). Saying out loud what the eval set was missing, in writing, in the repo, is the cheapest known mechanism to keep the eval set honest.
What this is not
It is not a replacement for OTel-style traces, dashboards, or vendor alerting. The four files do not store metrics. They do not visualize traces. They do not ingest spans. All of that lives in whatever stack the client already runs (the engagements we have shipped use mixes of OTel collectors, LangSmith, Langfuse, Datadog, vendor-native tracing). The four-file contract sits next to that stack and answers a different question: when the alert fires, what does the engineer read, and where do they write down what they did.
It is also not a substitute for an on-call rotation, a paging tool, or human ownership. The owner field in alerts.yaml is a pointer at a real human or team, not a placeholder. The runbook does not relieve the engineer of judgement; it gives them a place to start. The seal does not write itself; somebody has to sign the markdown file. What the contract does is force every part of the response to be a thing that lives in the repo, so that the next person who touches this can find it.
Where fde10x sits in this
We are a forward deployed ML engineering studio. Senior engineers go inside the client's GitHub, Slack, and standup in week 1, ship a working agent prototype in client staging by end of week 2, and hand off the runbook, the eval harness, the trust card, and the four-file monitoring contract on week 6. The first alerts.yaml is committed in week 3 with the three to five alerts the engineer who built the agent considers high-severity. The runbooks for those alerts are written by the same engineer the same week. The seals for the first incidents that fire during the engagement are authored by the embedded engineer with the on-call engineer copilot. By week 6 the client has shipped real seals on real incidents and owns the system top to bottom.
The engagement ends. The four directories stay in the repo. The lint stays in CI. The next on-call engineer the client onboards reads the runbooks the same way the embedded engineer wrote them. We are model vendor neutral, so the alert thresholds and snapshot pins work against any mix of Anthropic, OpenAI, Bedrock, Vertex, Azure OpenAI, or open-weight. No platform license. No vendor-attached runtime. The team keeps shipping after we leave because the response layer is in the repo, not in our heads.
Want a senior engineer to ship the four-file monitoring contract inside your repo?
60 minute scoping call with the engineer who would own the build. You leave with a draft alerts.yaml for the agent you are running, a runbook template tailored to your on-call rotation, and a fixed weekly rate to ship the lint, the CI gate, and the first signed seals in two to six weeks.
The four-file monitoring contract, answered
Why a YAML file and a directory of markdown instead of a vendor dashboard?
Because the on-call engineer's workflow at 2am is grep, not log in. When an alert fires the engineer has the agent repo open in their editor on their laptop. They have a terminal. They do not necessarily have SSO into the vendor dashboard, especially if they are new to the rotation, on a personal device, or one team away from the original author. Putting the alert spec, the runbook, the reproduction notebook, and the seal in the repo means the workflow is grep, read, run, write, commit. There is no second tool to learn. This also makes the agent's monitoring a thing that survives platform migrations and contract renewals; the repo is the source of truth, not a vendor account.
Are you saying do not use vendor monitoring tools at all?
No. We use OTel-compatible trace stores, dashboards, and judge-grading platforms on every engagement we ship. The point is the SOURCE OF TRUTH for the alert spec and the response. Vendor tools are great for the trace storage and the live dashboard. They are not great for capturing the response runbook in a way the next on-call engineer will reliably find. The four-file contract sits on top of whatever vendor stack the client already runs. The alerts.yaml file references metrics whose names match the vendor's metric names; the runbook tells the engineer which dashboard to open and which query to run. Tools come and go; the runbook is the part that has to outlive the tool.
What is the smallest version I can ship next week without an embed?
Three things, in order. First, create monitoring/alerts.yaml with three to five alerts that the engineer who built the agent considers high-severity. Second, create monitoring/runbooks/<alert_id>.md for each of those alerts with the five required sections; if you cannot write the patch protocol section in fewer than 200 words, the alert is not well-defined yet and that is itself the work. Third, write a 30 line lint script that fails CI when an alert in alerts.yaml does not have a corresponding file under runbooks/. The repro notebook template, the seal directory, and the snapshot pin check can be added in week two.
What goes in the patch protocol section of the runbook?
Three things. The smallest acceptable patch shape (a feature flag flip, a prompt change, a tool retry budget tweak, a model snapshot rollback). The bound on patch latency for this severity tier (critical patches ship within 30 minutes of the alert firing, high within 2 hours, medium within the day). And the rule for when a patch is allowed to ship without a regression test added to the eval set; for high and critical alerts, the rule is never. The patch protocol section is short on purpose. If the runbook author cannot constrain the response in a few hundred words, the alert is doing more than one thing and should be split.
Why is the reproduction step a notebook and not a script?
Because the on-call engineer is going to look at the data, not run a fixed sequence. A notebook lets them pull traces, replay prompts, regrade with a pinned judge, and compare side by side with the last green run, in whatever order the investigation actually goes. A script would force a linear path that is wrong for at least half of the incidents we have worked. The templating is the structure that matters: cells are pre-named for trace fetch, prompt replay, judge regrade, and comparison, so the engineer is not improvising the cell layout in the middle of an incident. Past notebooks in monitoring/repro/ are kept; the next person reads them before starting a new investigation, and that is where most institutional memory comes from.
How does this connect to the eval harness and the trust card?
Tightly, by design. Every closed seal must reference a regression case ID added to the eval set. Without that, the lint flags the seal as unfinished and the alert stays open. The trust card we ship into client repos has a counter that goes up every time a seal commits a new regression case; over time the eval set becomes a record of every production failure mode the agent has ever surfaced. The judge_kappa_drift alert fires off the trust card directly. The silent_model_snapshot_rotation alert is a runtime check against the pin in evals/trust_card.md. Monitoring, eval, and trust are not three teams; they are three views of the same loop.
What is the lint actually checking?
Four things. One, every alert_id in alerts.yaml resolves to a runbook file at the declared runbook_path. Two, every runbook contains the five required section headers, in order, and none of them is empty. Three, every seal under monitoring/seals/ references a closed alert_id, an incident_id with a notebook in monitoring/repro/, and a regression case ID that exists in the eval set. Four, every alert that has fired in the last 30 days according to the trace store either has a closing seal or a still-open incident_id with a notebook checked in. The script is around 200 lines of Python on the engagements we have shipped. The point of keeping it small is so the client can read and own it on day one.
What does this look like for an agent that is multi-step or multi-agent?
The shape stays the same; the alert IDs and runbooks multiply. For a multi-step agent we add per-step latency alerts and per-step error rate alerts, each with its own runbook. For a multi-agent system we add hand-off alerts (one agent never returning to the orchestrator, an agent calling itself recursively past a depth bound) with their own runbooks. The four-file contract scales with the agent's surface area; what does not scale is having one runbook that covers everything, because nobody reads a 30 page runbook at 2am.
Where does fde10x sit on this?
We are a forward deployed ML engineering studio. Senior engineers go inside the client's GitHub, Slack, and standup in week 1 and ship a working agent prototype in client staging by end of week 2. The four-file monitoring contract is part of the leave-behind IP we ship by week 6: alerts.yaml committed to the repo, the first set of runbooks written by the engineer who built the agent, the lint script in CI, and the first seals authored on the first incidents that fire during the engagement. The client owns the repo, the runbooks, the lint, and the seals. We are model vendor neutral, so the alert thresholds and snapshot pins work against any combination of Anthropic, OpenAI, Bedrock, Vertex, Azure OpenAI, or open-weight. No platform license, no vendor-attached runtime.
What if our team already has a monitoring stack we like?
Keep it. The four-file contract is not a replacement for traces, dashboards, or alerting infrastructure. It is the response layer that lives in the repo on top of those tools. alerts.yaml can reference your existing metrics by name; the threshold expression is the same condition you would set in your alerting tool. The runbook tells the on-call engineer which dashboard URL to open and which trace query to run. The discipline is that the alert spec and the response are committed to the repo together. Most teams we work with already have a perfectly good vendor for traces and dashboards; what they did not have was a place where alert plus response are versioned together and where new on-call engineers can find both without asking.
The trace pipeline, the trust card, and the regression eval set the four-file contract plugs into
Related guides
Agent eval harness: production traces vs hand-written cases
How the trace pipeline that fills monitoring/repro/ notebooks gets built, and the same-judge-same-rubric grading contract every reproduction depends on.
Agent eval set, model swap, trust: the two-number trust card
The trust card the silent_model_snapshot_rotation alert reads at runtime, and the kappa and Spearman gates that gate every model swap.
Agent regression eval set: the ratchet rule and the deletion ledger
Where every closed seal's regression case lands, and the lint that refuses a PR that drops a case without a recorded rationale.