Memory and Context Management: The Hardest Problem in Building with Agents

Part 2 of 5 | Series: What We Learned from the Claude Code Leak

If you have been working with agents, you know that moment when you feel the session starting to drift? You are several hours into your session. The context window fills up. The agent starts forgetting things it knew twenty minutes ago. You patch it with summaries, reintroducing requirements and guidelines. This is so painful and frustrating.

This is still one of the central unsolved problems in agent engineering: context and memory.

This is my second post inspired by the Claude Code leak. As we all dig deeper into it, the memory problem is one of the most technically interesting parts. Anthropic has clearly been working on memory architecture at a depth most others are not close to, and the source makes that clearly visible.

If you are trying to build an engineering organization that can execute with agents in the loop, context and memory management becomes a must-solve problem. An agent that forgets what it was doing is not that useful, because it cannot stay reliable on complex work. Once the context window starts to fill, you cannot just delegate and walk away. You have to break the work into smaller pieces, constantly rebuild context, and re-prompt things the agent already knew. That is why memory matters so much. Memory is not separate from the context window problem. It is the only real way to extend an agent beyond it.

This is also why I wrote recently about Why I Think Dreaming Is a Real Breakthrough for Agent Memory. That post is about OpenClaw, but the underlying problem is the same. Once the working session gets too large, raw context stops being enough. What matters is whether the system can consolidate, curate, and promote the pieces of memory that deserve to survive into future work.

This also lines up with what I have seen while building my own agents and what I see happening with my teams. In longer sessions, once context got very large and compaction kicked in, quality started to degrade in subtle ways first. The agent would repeat work it had already done, miss constraints it already knew, or get more generic on multi-step tasks. That is the real reason I care so much about memory architecture. There is nothing more frustrating than working for hours with an agent and then realizing it forgot what it was doing, or forgot a key guideline. The worst part is that you do not know what it forgot, so you have to assume it forgot everything and rebuild the context from scratch.

If the memory system is good enough, the context window stops being such a problem. Two hundred thousand tokens is already a lot to work with if the agent knows what to keep live, what to compress, and what to recover only when needed.

That is also why one of the most interesting parts of the Claude Code leak is not just the size of the context window. It is the architecture around managing it.

Not a Flat File#

The first thing to understand: Claude Code’s memory system is not just a CLAUDE.md file. It is a 3-layer index, and each layer exists for a specific reason.

Layer 1: The Index (always loaded). Just pointers. About 150 characters per line. Cheap to keep in context at all times. This is the table of contents, not the book.

Layer 2: Topic files (loaded on demand). The actual knowledge. Fetched only when relevant to what the agent is working on in the current turn. If you are debugging an authentication issue, the agent loads what it knows about auth. It does not also load everything it knows about the data pipeline.

Layer 3: Transcripts (never loaded, only searched). Historical session logs are never read directly into context. They are searched for specific information when needed, like a search index, not a source of truth.

The write discipline is equally strict. Always write to a topic file first, then update the index pointer. Never dump content into the index directly. And if a fact can be re-derived from the codebase itself, it is not stored at all.

The real idea here is bandwidth awareness. Most agent memory implementations load everything into context every turn. It is expensive and introduces noise. Claude Code treats the context window as a scarce resource. What they choose not to store matters as much as what they do.

                    MEMORY SYSTEM

    +---------------------------------------------+
    | Index layer                                 |
    | always loaded, tiny pointers, cheap lookup  |
    +---------------------------------------------+
                      |
                      v
    +---------------------------------------------+
    | Topic files                                 |
    | loaded on demand when context is relevant   |
    +---------------------------------------------+
                      |
                      v
    +---------------------------------------------+
    | Transcripts                                 |
    | never loaded whole, only searched when      |
    | the agent needs a specific fact             |
    +---------------------------------------------+

Five Strategies for When Memory Runs Out#

Every long-running agent will eventually overflow its context window. It is not an edge case. It is a guaranteed failure mode in production. Claude Code handles it with five distinct compaction strategies in a hierarchy, each a fallback for the one before it.

Proactive compaction. Token count is monitored each turn. When it approaches the limit, older messages are summarized and replaced with a “compact boundary” marker before the API call is made. The user never sees a failure.

Reactive compaction. If the proactive check misses and the API returns prompt_too_long, this catches the error, compacts retroactively, and retries. The user sees a brief delay, not a crash.

Snip compaction. SDK and headless mode only. Instead of summarizing, it truncates at defined boundaries to keep memory bounded in long automated sessions where scroll history does not matter.

Context collapse (codenamed marble_origami internally). Compresses verbose tool results mid-conversation without triggering full compaction. If a tool returned 500 lines of output three turns ago and it is no longer relevant, this collapses it to a shorter representation. The important detail: collapse commits are persisted as ContextCollapseCommitEntry records, which means they can be selectively un-compacted later if needed.

Memory prefetch during streaming. While the model generates its response, the system prefetches relevant memories from CLAUDE.md files in parallel. By the time tools start executing, the relevant memory is already loaded. This hides the I/O latency of memory retrieval entirely. The session flows without a visible pause.

Another way to see the stack:

normal turn
   |
   +--> proactive compaction
   |
   +--> if estimate misses: reactive compaction
   |
   +--> if headless / long automation: snip compaction
   |
   +--> if old tool output is noisy: context collapse
   |
   +--> in parallel: memory prefetch

AutoDream: The Agent That Learns While You Sleep#

The most striking piece of memory architecture in the entire codebase is a system called AutoDream, currently hidden behind a disabled KAIROS feature flag.

When the user goes idle or manually tells Claude to sleep at the end of a session, AutoDream runs. The prompt Anthropic wrote for it says: “you are performing a dream, a reflective pass over your memory files.”

The dream process does four things:

Scans the day’s transcripts for “new information worth persisting”
Consolidates that information, avoiding “near-duplicates” and “contradictions”
Prunes memories that are overly verbose or newly outdated
Watches for “existing memories that drifted” and corrects them

The goal, in the code’s own words: “synthesize what you’ve learned recently into durable, well-organized memories so that future sessions can orient quickly.”

Two implementation details matter. First, AutoDream runs in a forked subagent with limited tool access, so it cannot corrupt the main context while reorganizing memory. Second, memory is treated as a hint, not as truth. The agent verifies before using stored information. I wish my dreams and memory operated like this 😂

You completed your tasks on Friday. You come back on Monday. The agent was not running new tasks. But it spent the weekend organizing what it already knew, so it can orient to your work faster on Monday morning.

For an engineering org, this matters a lot. One of the hardest parts of scaling execution is that context lives in too many places: docs, PRs, Slack, incident postmortems, tribal knowledge. The interesting thing about this memory design is not just that it helps the agent. It suggests a different way to think about organizational context itself.

I had to do a smaller version of this recently with my ghostwriter agent’s historical memory. Before importing it into Dreaming, I had to verify that the bundle was clean, scoped only to that agent, preview the grounded output, and make sure I was not mixing in other agents or operational junk. That is not a glamorous task, but it is the real work of turning memory into something durable instead of just accumulated data.

What This Means for Teams Building Agents#

The three-layer index and the five-layer compaction stack did not exist because they were elegant. They exist because someone hit every failure mode in production and built a response to each one. This is what time spent working with agents on real workloads looks like.

If you are building agent systems today, the questions worth asking are:

Are you loading all information, all memory into context every turn, or are you treating the context window as a resource to be managed?
What happens to your agent at turn 200 of a long session? Do you have a tested way to recover and verify the work of the agent?
When the session ends, what survives? Does the next session start fresh, or does it know what happened?

The Claude Code source suggests that the harness, not just the model, is where a lot of this problem gets solved. Maybe models will get better here over time. But today, if you want consistent outcomes from agents at scale, you need to build around the limitation.

My takeaway is simple: if you want Agentic-First execution to be real inside an engineering org, you need to invest seriously in your context engine. It cannot be an afterthought, and it cannot be just a vector database bolted on at the end. Memory, compaction, retrieval, and consolidation have to be part of the architecture from day one.

Sources: Haseeb Qureshi, Inside the Claude Code source · Engineer’s Codex, Diving into Claude Code’s Source Code Leak · Ars Technica, Here’s what that Claude Code source leak reveals about Anthropic’s plans