Guide, topic: ai agent memory systems, 2026
AI agent memory systems are a per-turn retrieval budget, not a framework choice.
Most articles on this topic line up Mem0, Zep, Letta, Cognee, and pgvector and pick a winner. They cover the storage half of the memory layer. None of them name the file the agent loads on every turn that says how much of the input window each memory tier is allowed to consume, what happens when a tier blows its latency ceiling, and which tier gets dropped first. That file is the part you have to write yourself, and it is the part that survives every framework swap. Below: thememory/retrieval_budget.yamlshape we ship, the four numbers that anchor it, the per turn flow that consumes it, and the Friday cron that keeps it honest.
The piece every existing article on this leaves out
Open the agent memory pages from the last quarter. They cover three things: a taxonomy (episodic, semantic, procedural, sometimes working and shared), a framework comparison (Mem0 vs Zep vs Letta vs Cognee vs Hindsight), and a sketch of a hot path / cold path architecture. All three are correct. None of them tell you what happens on a single turn when short term retrieval finishes in 8ms, working memory in 22ms, long term retrieval blows past its 180ms ceiling because the index just rebalanced, and verified doc retrieval comes back with six grounded passages. Which tier do you wait for? Which do you drop? How much of the input window did each one consume?
Those questions are not answered by the framework. Mem0 returns top K rows. Zep returns sessions and graph hits. Letta returns memory blocks. Each of them is a runner. None of them ship a per turn budget. The agent author is left to count tokens after the fact, hope retrieval finishes, and live with whatever the framework's default timeout does when it does not. That gap is where most production agent memory failures live; they look like a hallucination but they are a missing tier the agent silently answered without.
One file fixes this. It lives in the client repo, not in the framework config. It reads on every turn, before any retrieval call. It treats latency as a first class constraint of the recall layer, not an afterthought. Below is the shape we have shipped on five named production agents across pgvector, Mem0, Zep, Letta, and a custom Postgres backend. The file is identical across all five; only one field, the source per tier, flips.
memory/retrieval_budget.yaml, the file no framework ships
Per tier token allotments, per tier k values, per tier similarity floors, per tier latency ceilings. Plus the fallback rule: a tier that blows its ceiling is dropped this turn, not awaited. The agent's input window is governed by this file, not by the framework's defaults.
memory/build_prompt.py, the only function that assembles a turn
The agent never calls the framework's retrieval method directly. Every prompt assembly goes through build_prompt(), which loads the budget, walks each tier in priority order, gates each tier on its own latency ceiling, summarizes to fit each tier's token allotment, and asserts the assembled prompt is under 92k input tokens before the model call. A blown assertion crashes the turn; truncating silently is not an option.
One turn, traced through the budget
Five participants per turn: the runtime, build_prompt(), and the four memory tiers. Each tier is gated by its own deadline. The runtime never awaits the slowest tier; it assembles the prompt from whichever tiers came back inside their budget and writes a JSONL line for each tier that did not.
agent.respond -> build_prompt -> per-tier retrieval -> model
What a clean turn through the budget looks like
All four tiers come back inside their latency ceilings. The hot path consumes 75,764 tokens of the 92,000 token budget. retrieval_drops.jsonl appends nothing. The model answers with the full memory stack in context.
What a turn with a dropped tier looks like
long_term_retrieved blows past its 180ms ceiling because pgvector is rebalancing. Instead of stretching the deadline or stalling the user, the tier is dropped, the agent answers from short_term, working, and rag_verified_docs, and the drop appends one JSONL line. The Friday cron grades the cumulative drop rate; if it crosses 1.5 percent for the week, a budget-drift PR opens.
Side by side: budgeted retrieval vs framework default
Left column is the shape we ship. Right column is what the framework gives you out of the box, regardless of whether the framework is Mem0, Zep, Letta, Cognee, pgvector, or Redis. The right column is not wrong. It is incomplete; it tells you where rows live, not how much of the input window each row gets to occupy on a single turn.
| Feature | Framework default retrieval | Budgeted memory retrieval |
|---|---|---|
| Per-turn token allocation across tiers | Implicit. Each framework returns 'top K' rows; the agent author is left to count tokens after the fact. Hot-path budget is a runtime surprise. | memory/retrieval_budget.yaml in your repo. Versioned by PR. Loaded by build_prompt() on every turn. Cap is asserted, not hoped for. |
| Per-tier latency ceiling | Most frameworks expose a global timeout. A slow long-term retrieval blocks the whole turn while the user watches a typing indicator. | max_latency_ms per tier in the budget file. A tier that blows the ceiling is dropped this turn, logged to retrieval_drops.jsonl, and the agent answers without it. |
| What happens when a tier returns zero rows | Silent. The agent answers without context from that tier and you find out from a customer ticket six weeks later. | drop_tier and log. The agent does not silently fall back to a worse tier or stall. The drop is auditable in a JSONL line per occurrence. |
| Where the cold tier lives | Cold and long-term blur into one bucket. Old rows compete with recent grounded rows for the same K slots and ranking can favor the older row. | Allocated zero tokens on the hot path. A cold row only enters context after scripts/promote-cold.sh moves it into long-term, in a separate turn. |
| Embedding model swap (text-embedding-3 -> 4) | Re-embed everything, recalibrate K, watch retrieval quality regress for a week, eyeball it. No file you can grep to prove it landed cleanly. | embedding_model_sha256 pinned in retrieval_budget.yaml. A swap is one PR that updates both the SHA and the per-tier similarity_floor. The budget is unchanged. |
| Framework swap (pgvector -> Mem0 -> Zep) | A framework migration. Re-pick K, re-test latency, rewrite the agent's retrieval calls, regress recall for two weeks, hope nobody notices. | long_term_retrieved.source flips. The interface (k, similarity_floor, max_latency_ms, namespace) is identical across runners. Hot path budget is unchanged. |
| Auditability after a customer escalation | A vendor dashboard, sometimes. A pcap of redis calls, hopefully. A retrieval log buried in stdout, if the engineer added one. | grep <user_id> memory/retrieval_drops.jsonl shows every tier that was skipped on every turn for that user. Six commands answer 'why did the agent not know that'. |
“Same retrieval_budget.yaml shape across pgvector + redis, Mem0 + redis, Zep on Bedrock, Letta on Vertex, and custom Postgres retrieval. The framework choice was the easy reversible decision; the per turn budget was the part that survived every swap. None of these clients run a vendor dashboard for the budget, because the budget lives in their repo.”
PIAS leave-behind across 5 named production agents, model-vendor neutral
How we calibrate the budget in 6 weeks
One week to draft the file, one week to measure latency, two weeks to redistribute the allocations from the actual usage histogram, two weeks to lock the budget and let the framework choice float on top. By week 6 the budget is a maintained artifact, graded weekly, not a guess from week 2.
Week 2: pick the input window cap, then back-fill the tiers
Start from your model's input window (200k for the longest, 128k for most production agents). Subtract reserved_for_output (28k is conservative). Subtract system + tools + current turn (about 10k). What is left, around 90k, is the retrieval budget. Allocate it across short_term, working, long_term_retrieved, and rag_verified_docs in roughly that order of trustworthiness.
Week 2: pick max_latency_ms per tier from production p95
Run 30 representative queries against your actual store, in your actual region, at your actual concurrency. Take p95 latency per tier and add 30 percent slack. That is your max_latency_ms ceiling. Ours land near 12, 60, 180, 240ms for short_term, working, long_term, rag respectively. Yours will differ.
Week 3: turn on retrieval_drops.jsonl and watch for a week
Every dropped tier writes one JSONL line. After a week, group by tier and reason. If long_term_retrieved drops more than 1 percent of turns from latency_breach, raise the ceiling or move the index closer to the agent. If it drops more than 5 percent from zero_results, your similarity_floor is too high.
Week 4: re-derive the token allocations from the actual usage histogram
Run scripts/budget-usage.sh to compute, per tier, the p50 and p95 of tokens actually used out of the allotment. If short_term's p95 sits under 9k while you allocated 18k, redistribute the slack to the tier that runs hot. The budget is a guess until the histogram says otherwise.
Week 5: lock the budget and let the framework choice float
By week 5 the budget should be stable enough to survive a framework swap. Flip long_term_retrieved.source from pgvector to Mem0 in a one-line PR; the interface stays the same and the histogram should not move outside its band. If it does, the framework is doing something the budget did not anticipate, and that is the conversation to have, not 'pick a framework'.
Week 6: production rubric grades the budget weekly
scripts/audit-budget.sh runs Friday 14:30 UTC, reads the last 7 days of retrieval_drops.jsonl, computes per-tier drop rate, and opens a 'budget-drift' PR if any tier crosses the threshold (default 1.5 percent). The budget becomes a maintained artifact, not a guess from week 2.
scripts/audit-budget.sh, the cron that keeps the budget honest
Sixty lines of bash. Reads memory/retrieval_drops.jsonl over the trailing seven days, computes per tier drop rate against a 1.5 percent threshold, and opens a budget-drift PR if any tier crosses. Runs Friday 14:30 UTC, fifteen minutes after the recall audit from the production agent memory guide. The two crons together grade the read side and the write side of the memory layer in sequence.
Want a senior engineer to draft your memory/retrieval_budget.yaml against your model and store?
60 minute scoping call with the engineer who would own the build. You leave with a draft of memory/retrieval_budget.yaml against your input window and your store, the per tier latency ceilings derived from your actual p95 numbers, and a fixed weekly rate to ship the budget, the build_prompt() function, and the first audit cron run inside your repo.
AI agent memory systems, the per turn retrieval budget, answered
What does 'AI agent memory system' actually mean in 2026?
It is the layer that decides what the agent remembers between turns and between sessions, and what reaches the model on any given turn. It has two halves. The first is the storage half: a vector store, a graph store, a key value store, or some combination, plus the policy that gates writes (see the production-ai-agent-memory guide). The second is the retrieval half: the per turn allocation of the input window across short term, working, long term, and verified document tiers. Most articles cover the first half. The second half is the focus of this guide because it is the half no framework ships.
How is this different from RAG?
RAG retrieves from a fixed corpus you indexed once. Agent memory retrieves from a per user, per session, ever growing store the agent itself wrote into. From the model's perspective both arrive in the input window as text; from the operator's perspective they have totally different failure modes. RAG fails when the index is stale. Memory fails when the writes are ungrounded or the retrieval allocation is wrong. The retrieval_budget.yaml shape covers both because rag_verified_docs is one of the tiers the budget allocates against.
Why is the per turn allocation a separate file from the framework config?
Two reasons. First, the framework choice is reversible (we have moved clients between Mem0, Zep, Letta, and pgvector in single PRs). The budget is not, and you do not want to re engineer it on every framework swap. Second, the budget is the only place the agent's input window is governed. If the allocation lives inside framework config, the framework owns the answer to 'how much of my context window is this tier allowed to consume?'. That answer should live in your repo, not in a vendor's defaults.
How do I pick the per tier max_latency_ms ceiling?
Run 30 representative queries against your actual store, in your actual region, at your actual concurrency level. Take p95 per tier, add 30 percent slack, round to the nearest 10ms. Our defaults (12 / 60 / 180 / 240ms for short term, working, long term, rag) come from five named production agents on pgvector and Mem0. Yours will differ if you run on Zep cloud, on a colocated index, or behind a regional gateway. Re calibrate at week 4 from real production traffic, not week 2 lab data.
What happens when a tier blows its latency ceiling?
It is dropped from this turn's context. Not retried, not awaited, not silently kept by stretching the deadline. The agent answers from the remaining tiers and writes one JSONL line to memory/retrieval_drops.jsonl with the tier, the turn_id, the reason, and the saw_ms. The Friday cron grades the per tier drop rate against a 1.5 percent threshold. Above that, a PR opens. Below it, the budget is doing what it is supposed to do.
Does this work with Mem0, Zep, Letta, Cognee, pgvector, and Redis?
Yes. The budget file does not name a framework; it names tiers (short_term, working, long_term_retrieved, rag_verified_docs) and gives each tier a source. The source is the only field that changes when the framework changes. We have shipped this shape on pgvector + redis at one client, Mem0 + redis at another, Zep on Bedrock at a third, Letta on Vertex at a fourth, and a custom retrieval over Postgres at a fifth. The budget file is identical across all of them; only the source field flips.
Why allocate zero tokens to the cold tier on the hot path?
A cold tier exists to keep rows that may matter again, without paying ranking cost on every turn. If a cold row enters every retrieval call, the cold tier stops being cold; it is just slow long term storage. Promotion to long term is an explicit step (scripts/promote-cold.sh, in a separate turn) that lets the operator notice that the cold tier is being mined heavily, which is itself a signal that the long term TTL is too aggressive. The hot path budget is for the model; the cold tier is for the operator.
How does the budget interact with prompt caching?
Prompt caching wants stable prefixes. The budget file produces a stable prefix shape because system_prompt and tools are fixed cost, current_user_turn is the last block, and the retrieved tiers sit between. Within that, short_term and rag_verified_docs are the two tiers that change least turn over turn. We sort the parts so the longest stable prefix sits first and gets the cache hit. Without a budget the prefix shape changes every turn and the cache hit rate drops to single digits.
Should I increase the budget if I move from a 128k to a 200k input window?
Increase reserved_for_output first (28k is the floor for most production agents in 2026; 32k or 40k is fine on a 200k window), then increase rag_verified_docs because that tier improves more from extra capacity than long_term_retrieved does. Do not raise short_term beyond 18k unless you have measured that turn over turn coherence is breaking; the marginal turn rarely needs more than 12 turns of literal history once the rolling summary is in place.
What is the leave behind on a 6 week engagement that includes the retrieval budget?
Six files in the client repo on main: memory/retrieval_budget.yaml, memory/build_prompt.py, memory/tiers.py, scripts/audit-budget.sh, scripts/budget-usage.sh, .github/workflows/budget-audit.yml. Plus memory/retrieval_drops.jsonl which the runtime writes and the cron reads. The named senior engineer who shipped it leaves a runbook in memory/runbook.md (shared with the write gate from the production-ai-agent-memory guide) that walks the on call engineer through 'long_term_retrieved is dropping at 4 percent, what now?'. We leave; the budget stays.
Adjacent guides
More on the leave-behind that defines production-ready
Production AI agent memory: the write-gate and the audit cron
The write side of the memory layer. memory/policy.yaml gates every long-term write; this guide is the read side companion that governs the per turn input window.
LLM agent eval harness: the judge-drift failure mode and the calibration cron
How the LLM that grades your agent is itself a moving target, and the eval/judge_prompts.yaml shape that pins the judge per axis. Eval-layer companion to this read-layer guide.
Production AI agent evals: the two-clock system
How the rubric clock (model release qualification) and the cases clock (Friday tail mining) interact. The cadence layer that surrounds the memory and judge layers.