Guide, topic: production ai agent memory, 2026
Production AI agent memory is a write-gate plus an audit cron, not a vector database choice.
Most articles on this topic frame production memory as a stack pick: pgvector vs Mem0 vs Zep vs Letta vs Redis, with a section on episodic, semantic, and procedural types. That is the storage layer. The half nobody addresses is the failure mode that ships the agent into a customer-facing escalation: a user misspeaks on day 1, the framework stores it as a high-confidence fact, and on day 22 the agent retrieves the misstatement and confidently asserts it to a different stakeholder. This guide is the inverse. The memory/policy.yaml write-gate that refuses ungrounded writes, the 30-case memory/recall_cases.yaml frozen benchmark, the 90-line scripts/audit-memory.sh cron we run every Friday, the 2 percent contradiction-rate threshold per axis, and the seven files we leave behind so the on-call engineer can grep a year of memory stability with one command.
Same memory layer across pgvector + Mem0, Zep on Bedrock, Letta on Vertex, custom orchestration on Redis, and a typed-preferences store.
Monetizy.ai / Upstate Remedial / OpenLaw / PriceFox / OpenArt
What every other guide on this gets wrong
Open the production agent memory pages from the last quarter. They cover three things: a taxonomy of memory types (episodic, semantic, procedural, sometimes working and shared), a framework or store comparison (Mem0 vs Zep vs Letta vs Redis vs pgvector), and a sketch of a hot-path / cold-path architecture. All three are correct. None of them tell you what happens when the user says something tentative or wrong on turn 14, the agent stores it, and three weeks later a different stakeholder asks a downstream question and gets the misstatement returned as fact.
That failure mode does not depend on which framework you picked. Mem0, Zep, Letta, and Redis can all reproduce it equally because they all default to writing whatever the agent decides is worth remembering. The fix lives one layer above the framework: a policy file that says 'a user statement alone is not enough; the write must be grounded in a tool_result or a verified_document', plus an audit cron that grades a frozen set of recall cases every Friday and opens a PR when contradiction rate crosses two percent.
The other thing those articles miss is the file structure. The recall layer fails when the named senior engineer leaves and the next person on-call cannot answer the question 'has the memory layer drifted in the last six weeks?' without re-deriving the harness. Seven files in the client repo solve that. The named files are below.
Anchor: the seven files we leave behind
Four under memory/, two under scripts/, one under .github/workflows/. Plus one file the cron writes back into the repo. Every file is plain YAML, plain bash, plain Python. Model-vendor neutral, no platform license, no vendor-attached runtime.
memory/policy.yaml
What is allowed to enter long-term memory. Per-axis confidence floors, max writes per session, namespace allowlist, hard-block list (PII, credentials, prompts the user explicitly asked the agent to forget).
memory/recall_cases.yaml
30 frozen recall cases per agent. Each case has a write turn, a delay, a probe turn, and a target answer two humans agreed on at week 2. The benchmark the audit cron grades against weekly.
memory/recall_history.jsonl
One JSON line per axis per week, committed back to the repo by the cron. Recall-correct rate, contradiction rate, retrieval p95 latency, judge-confidence delta. A year of memory stability is one grep away.
memory/runbook.md
On-call response. When recall_history.jsonl shows contradiction-rate over 2 percent, this file walks the engineer through pulling the offending memory rows, opening the write-gate audit PR, and rolling the namespace back to the last green snapshot.
scripts/audit-memory.sh
90 lines of bash. Runs every Friday at 14:00 UTC. Loads memory/recall_cases.yaml, replays the writes against a clean snapshot, runs the probes, computes contradiction rate per axis, appends to recall_history.jsonl, opens a PR if a threshold breaks.
scripts/decay-memory.sh
TTL and staleness sweeper. Runs nightly. Drops rows older than ttl_days, demotes rows that contradict newer evidence, never silently deletes a row that was used by a past customer-facing answer (those move to memory/cold/, not /dev/null).
.github/workflows/memory-audit.yml
GitHub Actions cron that fires audit-memory.sh on a fresh checkout. Posts the recall scorecard to a comment on the latest open PR. The harness lives in your CI, your repo, your on-call rotation. No vendor dashboard, no SaaS license to renew.
memory/policy.yaml, the write-gate
The file the agent loads on every long-term write. Per-store backend and ttl, embedding model pinned by SHA, the write_gate block that decides what is allowed to land in long-term memory, and the audit thresholds the Friday cron grades against. Every change to a write_gate field is a separate PR titled memory-policy.
memory/write_gate.py, the only path that writes long-term memory
The agent never imports the framework client directly. Every "remember this" call goes through accept_or_reject, which loads memory/policy.yaml and runs five checks in order: namespace allowlist, forbidden substrings, grounding kind, grounding confidence, contradiction with recent rows, per-session cap. Refusing to write is the common case for ungrounded statements, and that is a feature.
memory/recall_cases.yaml, the frozen benchmark
Thirty cases, two humans agreed on each target during week 2. Each case has a write turn, a number of intervening turns to delay, a probe turn, and target assertions (must_include, must_not_include, must_not_assert). recall-002-poisoning-trap is the case that catches the day-1-misspeak failure mode on day 8 instead of day 22.
scripts/audit-memory.sh, the Friday cron
Ninety lines of bash. No deps beyond bash, yq, jq, and python. Snapshots production memory into an isolated namespace audit-YYYY-MM-DD, replays the recall cases against it, computes contradiction rate per axis, appends one row per axis to memory/recall_history.jsonl, and either passes or opens a memory-drift PR.
What a clean Friday cron looks like
Six axes, 30 cases, contradiction rates well under the two-percent threshold. No PR opens, the run lines in memory/recall_history.jsonl append, and the harness keeps gating merges with the same calibration baseline.
What a Friday cron with memory drift looks like
poisoning_resistance crosses the threshold after a week where the write-gate was relaxed in PR #589. The cron exits non-zero, opens a PR titled memory-drift detected on 2026-04-26, and stops gating merges on poisoning_resistance until a human triages. The agent rubric still grades; production answers the same way it did yesterday.
How the recall flow wires together
Three sources feed the write-gate on the left. Three artifacts land in the repo and the PR thread on the right. The framework store in the middle is interchangeable; the gate, the policy, and the history are the parts you keep.
user turn + tool result + verified doc -> write_gate.py -> store + rejected_writes + recall_history
Memory poisoning, traced from day 1 to day 23
One trap case, two timelines. Without the write-gate, a casual user statement becomes a confident fact 22 days later. With the write-gate, the same statement is rejected at write time, the rejection is auditable, and the audit cron grades the same trap case every Friday so a regression in the gate logic surfaces in eight days instead of three weeks.
Day 1, turn 14: user misspeaks
User says: 'I'm pretty sure the deadline is March 30th but I'll confirm.' Without a write-gate, the agent's memory layer happily stores 'user deadline = March 30' as a high-confidence fact. With the write-gate, grounding_kind is 'user_statement' which is NOT in must_be_grounded_in (tool_result, verified_document). The write is rejected with reason ungrounded:user_statement and logged to memory/rejected_writes.jsonl.
Day 8, Friday cron: poisoning_resistance axis grades the trap case
scripts/audit-memory.sh replays recall-002-poisoning-trap. The probe turn fires: 'When is my deadline again?'. Target answer says agent must NOT assert the date and MUST mention the user said they would confirm. If the agent passes back 'March 30' the case fails. Across the 6 cases in the axis, contradiction rate has to stay under 2 percent.
Day 22 (the bad timeline): the misstatement ships to a customer
Without the write-gate and the audit, day-1's casual 'I'll confirm' has become long-term memory the retrieval ranks highly because nothing more recent contradicts it. The agent answers a downstream question about scheduling with 'your deadline is March 30' and the customer hits a wall. This is the failure mode every other production agent memory guide names but does not solve.
Day 22 (the good timeline): the rejected_writes log explains the silence
The customer asks the same scheduling question. The agent has nothing in long-term memory about the deadline because the day-1 row was rejected. The agent says 'I do not have a confirmed deadline on file; you mentioned you would send one. Want me to check the docs?'. memory/rejected_writes.jsonl shows the write-gate decision. The on-call engineer can grep one line to see what was kept out of memory and why.
Day 23: a real tool result comes in, the write is accepted
The user sends a calendar invite. The agent's calendar tool returns an event with a deadline. grounding_kind becomes 'tool_result', grounding_confidence is 0.94, the write-gate accepts. The row lands in long-term memory under namespace agent_id=foo/user_id=42, ttl_days 90. The next probe of 'when is my deadline?' answers it with the source.
Week N: memory/recall_history.jsonl shows the steady state
One JSON line per axis per Friday. A year of memory stability is one grep memory/recall_history.jsonl away. When the on-call engineer is paged on a customer escalation, they grep this file before they touch the agent code, the embedding model, or the framework. The recall layer is auditable; the framework choice is not.
The four shape rules of the write-gate
Grounded by default. Forget on request. Contradiction-aware. Namespace-isolated. Each rule is enforced by a section of memory/policy.yaml, exercised by an axis in memory/recall_cases.yaml, and gated by the Friday cron.
Grounded, 100 percent
Every long-term row has grounding_kind = tool_result or verified_document. A user statement alone never lands. The write-gate is the difference between memory and rumor.
Forget on request, by design
When the user says 'forget X', the agent calls memory.forget(namespace, predicate). The row tombstones immediately and the audit case forget-compliance grades the next probe.
Contradiction-aware
Before a write lands, the gate retrieves recent rows in the same namespace and checks for direct contradictions. Two contradicting facts cannot coexist; the older row is demoted to memory/cold/.
Namespace-isolated by default
Every row carries agent_id and user_id in its namespace. Cross-tenant memory leakage is structurally impossible, not a runtime check we hope holds. namespace_isolation is one of the audit axes.
Side by side: the gate and the framework default
The shape we ship
memory/write_gate.py + memory/policy.yaml + scripts/audit-memory.sh
The framework (Mem0, Zep, Letta, Redis, pgvector) is the store. The gate is the policy that decides what is allowed to land in the store. The audit cron is what proves, week over week, that the gate is doing its job. Without the gate the store is whatever the agent thinks should be remembered. Without the cron the team learns the gate regressed from a customer ticket.
The weekly cadence, in order
One Friday tick. Six steps. Same shape every week. The thing that changes is the contradiction rate per axis and whether a PR opens.
Friday 14:00 UTC: memory-audit.yml fires
GitHub Actions cron runs scripts/audit-memory.sh on a fresh checkout of main. The job snapshots production memory into an isolated namespace audit-YYYY-MM-DD, replays every recall case from memory/recall_cases.yaml against that snapshot, and grades the agent's responses against the human-agreed targets. Production traffic is untouched.
Each axis re-runs against its frozen cases
Six axes total: personal_preference_recall, poisoning_resistance, forget_compliance, multi_session_continuity, namespace_isolation, contradiction_rejection. Each axis has at least 5 cases. The job computes contradiction rate, recall-correct rate, and retrieval p95 latency per axis.
Result appends to memory/recall_history.jsonl
One JSON line per axis per week: date, axis, contradiction_rate, recall_correct_rate, retrieval_p95_ms, threshold. The file is committed back to the repo by the cron, never lives in a vendor dashboard. A year of memory stability is one grep memory/recall_history.jsonl away.
Above-threshold contradiction rate opens a 'memory-drift' PR
Title: memory-drift detected on YYYY-MM-DD (N axis(es)). Body links the per-axis report and the offending case ids. Until a human triages, the affected axis stops gating merges; other axes still grade. The agent rubric still runs. Production answers the same way it did yesterday.
Triage path A: tighten the write-gate
If the failure traces to ungrounded writes (the most common cause), the engineer raises min_grounding_confidence, expands forbidden_substrings, or adds a new must_be_grounded_in source. memory/policy.yaml is the only file that changes. Re-run audit-memory.sh on the PR; merge when green.
Triage path B: extend recall_cases or accept the drift
If the failure traces to a behavioral shift the team explicitly chose (an embedding model upgrade, a new tool that grounds writes), the engineer extends memory/recall_cases.yaml with the new cases and bumps case_set_id. Either path is one PR. The decay sweeper, the runbook, and the rest of the harness are unaffected.
Receipts: gated memory layer vs the framework-default playbook
Left: the memory layer we ship. Right: the framework-default shape that shows up in most existing pages on this topic. The right column is not wrong. It is incomplete: it tells you which store to use, not what to do when the store fills with rumor.
| Feature | Framework-default agent memory | Gated memory + weekly recall audit |
|---|---|---|
| How long-term memory writes happen | The agent's framework persists every 'remember this' call. Whatever the user says becomes a fact the agent will retrieve later. | Every write goes through memory/write_gate.py which loads memory/policy.yaml. Ungrounded user statements are rejected, not stored. |
| Memory poisoning | Not addressed. The first time the team notices is when a customer escalation traces back to a misstatement from week 1. | poisoning_resistance is an axis with 6 frozen cases. The Friday cron grades it. Above 2 percent contradiction rate, a PR opens. |
| Forget on request | Most frameworks let you delete by id. There is no audit case proving the agent honored the request the next time it was asked. | memory/policy.yaml has a forbidden_substrings list and a forget endpoint. forget_compliance is an audit axis. |
| Recall correctness over time | Spot checks. The team learns the recall layer regressed from a customer ticket, not from a green or red dot in CI. | 30 frozen cases scored weekly. memory/recall_history.jsonl appended to the repo. A year of recall stability is one grep away. |
| Where the memory spec lives | Inside Mem0 / Zep / Letta config or a SaaS dashboard. The harness is portable; the spec is not. | Plain YAML in your repo. Versioned by PR. Survives the engineer leaving. |
| Cross-tenant isolation | A column in the row, hopefully filtered correctly at retrieval. No standing audit that proves it. | namespace = agent_id={agent_id}/user_id={user_id}. namespace_isolation is an audit axis with 4 cases. |
| Definition of 'memory works' | The recall layer works when the engineer remembers to spot-check after a deploy. There is no file you can grep to prove it. | git log shows scripts/audit-memory.sh has run on at least four of the last six Fridays. memory/recall_history.jsonl appends each week. |
The 9-bullet pass-state of a gated memory layer
- Every long-term write goes through memory/write_gate.py (no direct framework calls)
- memory/policy.yaml.write_gate.must_be_grounded_in does NOT include 'user_statement'
- memory/recall_cases.yaml has at least 5 cases per axis, 30 cases total
- poisoning_resistance is an axis (the trap case the framework playbook misses)
- forget_compliance is an audit axis with at least one delay-5 case
- scripts/audit-memory.sh has been run on at least 4 of the last 6 Fridays
- memory/recall_history.jsonl appends one row per axis per cron run
- contradiction_rate_threshold is set to <= 0.02 (the empirical floor on a 30-case set)
- embedding_model is pinned by SHA in memory/policy.yaml
“Six recall axes per agent: personal_preference_recall, poisoning_resistance, forget_compliance, multi_session_continuity, namespace_isolation, contradiction_rejection. Same shape across pgvector + Mem0 on Monetizy, Zep on Bedrock at Upstate Remedial, Letta on Vertex at OpenLaw, custom orchestration on Redis at PriceFox, and a typed-preferences store at OpenArt. The gate is the half of the memory layer that survives the engineer leaving.”
PIAS leave-behind across 5 named production agents, model-vendor neutral
Counts that anchor the rest of the page
Engagement-level facts, not invented benchmarks. Per-client production metrics live on /wins.
0 files in the leave-behind. 0 recall cases per agent. 0% contradiction threshold. 0 lines of bash. The four numbers that make the memory layer either a control or a vibe.
Want a senior engineer to draft your memory/policy.yaml and write the audit cron?
60-minute scoping call with the engineer who would own the build. You leave with a draft of memory/policy.yaml against your axes, the trap-case shape that catches memory poisoning on your stack, and a fixed weekly rate to ship the gate, the recall benchmark, and the first audit run.
Production agent memory, the write-gate, the audit cron, answered
What is production AI agent memory in 2026, and why is it different from RAG?
RAG retrieves from a fixed corpus you indexed once. Production agent memory retrieves from a per-user, per-session, ever-growing store the agent itself wrote into. The implication is that every long-term memory write is a tiny model output your agent committed without a human in the loop. Without a write-gate, the memory layer becomes the highest-bandwidth way to inject ungrounded statements into the system. The fix is to treat memory writes the way you treat code: gated, reviewed, audited. memory/policy.yaml is the gate, memory/recall_cases.yaml is the test, scripts/audit-memory.sh is the CI.
What is memory poisoning and why does every framework guide skip it?
A user says something tentative or wrong on day 1. Without a write-gate, the framework happily stores it as a fact. On day 22 a downstream question retrieves the high-confidence row and the agent confidently asserts a thing that was never true. The framework guides skip it because it is not a framework problem; Mem0, Zep, Letta, and Redis can all reproduce it equally. The fix lives one layer above: a policy file that says 'a user statement is not enough; tool_result or verified_document is required'. Once the gate is in place, recall-002-poisoning-trap is a frozen case and the Friday cron grades it weekly.
Why split memory into a write-gate plus an audit cron, instead of trusting Mem0 / Zep / Letta defaults?
The frameworks are runners. The harness is the wiring that pins them. Mem0 stores and retrieves; the write-gate decides what is allowed to be stored. Zep tracks sessions; the audit cron grades whether the agent retrieves the right thing weeks later. Letta has memory blocks; the policy file decides which blocks belong to which user. The framework choice is reversible (we have moved clients between Mem0 and pgvector in a single PR). The harness is the part that survives the swap. It is also the part that survives the engineer leaving the engagement.
Why 30 recall cases and not 100, and why two percent as the contradiction threshold?
30 is the smallest set we have seen reliably surface real recall regressions while keeping the cron cheap enough to run every Friday. Below 18 cases, run-to-run variance dominates. At 100 cases the cron costs roughly four times as much, runs four times longer, and the engineer stops looking at the result. Two percent contradiction rate is the empirical floor: under two percent, two humans on the same case set will themselves disagree by that much. Above two percent, you are seeing a real regression. Both numbers are defaults from the same five named production agents we run; you should re-pick them at week 2 with your product lead.
Should the write-gate ever accept user_statement as grounding?
Almost never for long-term memory. For working memory (Redis, TTL 30 minutes), yes; the agent has to maintain conversational coherence. For long-term memory (pgvector, TTL 90 days), no. If a fact is worth remembering for 90 days it is worth confirming with a tool call, a doc cite, or an explicit user confirmation turn. The exception is preferences ('I prefer dark mode') where the user is the authoritative source and there is nothing to ground against. Those go in a typed preferences namespace with a smaller surface area, not in the open memory store.
How does this compare to Mem0, Zep, Letta, LangChain memory, and Redis-based agent memory?
Those are runners and stores. The harness is the wiring that pins them. Mem0 ships built-in extraction prompts; the harness writes the prompt SHA into memory/policy.yaml and gates merges if the SHA on disk does not match the pin. Zep ships session and graph memory; the harness adds the recall benchmark, the weekly cron, and the contradiction threshold. Letta ships memory blocks; the harness owns which blocks belong to which user and who can write to them. LangChain has ConversationBufferMemory and friends; the harness is what makes any of them safe to ship to a real customer. Redis is a fast store; the harness is what decides whether a row should land in it. None of the frameworks answer the question 'has the recall layer drifted in the last six weeks?'. memory/recall_history.jsonl does.
What does the memory-drift PR actually look like when it opens?
Title: memory-drift detected on YYYY-MM-DD (N axis(es)). Body lists the contradiction rate per axis, links the per-case report with the agent's actual responses and the human-agreed targets, and shows the diff between this week's score and last week's. The PR has two paths. Path A is tighten the gate: bump min_grounding_confidence, add to forbidden_substrings, or add a new must_be_grounded_in source. Path B is accept the drift: extend memory/recall_cases.yaml with the cases the team chose to behave differently, bump case_set_id, attach a fresh audit-memory.sh run. Either path is one PR. The drifted axis stops gating merges until the PR lands; other axes keep grading.
What about PII, GDPR, and the right to be forgotten?
memory/policy.yaml.write_gate.forbidden_substrings catches the obvious cases (ssn, credit_card, raw bearer tokens) at write time. The harder cases (a name, an address) are namespace-scoped: when a user invokes deletion, scripts/forget-user.sh drops every row in agent_id=*/user_id=<the user> and tombstones their entries in memory/cold/. The audit axis forget_compliance has cases that probe the next session and assert the row did not survive. The runbook has a page for 'a user requested deletion'. None of this is a SaaS feature; it is bash, yq, and python in your repo.
Does this work for tool-using agents, code agents, vision agents, and voice agents?
Yes. The write-gate cares about grounding_kind and grounding_confidence; tool-using agents tend to write more grounded rows than chat agents, so the recall scorecard tends to be greener out of the gate. Code agents add a deterministic axis (compiles, lints clean) that lets the gate accept code-related memory writes more aggressively. Vision agents add image-hash checks. Voice agents add transcript-checks. The memory/policy.yaml shape is identical across modalities; the must_be_grounded_in list grows, and the recall_cases axes pick up modality-specific cases (cross-session voice continuity, scene_consistency for video).
What is the leave-behind on a 6-week PIAS engagement that includes the memory layer?
Seven files in the client repo on main: memory/policy.yaml, memory/recall_cases.yaml, memory/runbook.md, memory/write_gate.py, scripts/audit-memory.sh, scripts/decay-memory.sh, .github/workflows/memory-audit.yml. Plus memory/recall_history.jsonl which is generated by the cron and committed back to the repo. The harness is model-vendor neutral: the embedding model is pinned by SHA in YAML, the store is named (pgvector / Mem0 / Zep / Redis), no SaaS license is required, and there is no platform-attached runtime. The named senior engineer who shipped it leaves a runbook in memory/runbook.md that walks the on-call engineer through reading recall_history.jsonl when a customer escalates. We leave; the harness stays.
Adjacent guides
More on the leave-behind that defines production-ready
LLM agent eval harness: the judge-drift failure mode and the calibration cron
How the LLM that grades your agent is itself a moving target, and the eval/judge_prompts.yaml shape that pins the judge per axis. The eval-layer companion to this memory-layer guide.
Production AI agent evals: the two-clock system
How the rubric clock (model release qualification) and the cases clock (Friday tail mining) interact. The cadence layer that surrounds the memory and judge layers.
AI agent eval harness: the 6-question shelfware audit and the 5-PR build order
Six shell commands to find out if your eval harness is shelfware, plus the five-PR build order that produces a harness which passes them.