Claude Code's 7-Layer Memory System: A Complete Re-architecture of 'Context Management'
200K token context not enough? Claude Code solves it with a 7-layer 'memory waterfall'—from dumping tool results to background 'dream' integration, each layer operates on a 'lowest cost first' principle, intercepting more expensive operations at every step. What's truly valuable isn't the number of features, but the fact that almost every decision revolves around preserving the prompt cache.
The context window is only so big: a single grep can take 100KB, a cat can take 50KB. Read a few files, run a few commands, and that 200K token budget is gone in an instant.
Claude Code's solution isn't to just increase the window—it built a 7-layer memory system, ranging from sub-millisecond lightweight cleanup to background 'dream'-like long-term integration. Each layer tries its best to prevent the next layer from triggering. The cheap layers block first, the expensive ones serve as a safety net.
**Tool Result Storage (Layer 1)**
Outputs exceeding a threshold are directly dumped to disk; only a 2KB preview + label remains in context. The key is that ContentReplacementState 'freezes' which result was replaced, ensuring byte-for-byte consistency of the prompt prefix with every API call, so the cache can hit.
**Micro-compression (Layer 2)**
Old tool results are cleaned before each round of calls. Three methods: time-triggered (inactivity >60 minutes), cache_edits (server-side deletions/modifications that don't break local cache), and the native context_management API. Only targets tools like FileRead, Grep, Bash; thinking blocks and user messages remain untouched.
**Session Memory (Layer 3)**
Continuously maintains structured Markdown notes at ~/.claude/projects/.../session-memory/.md. When triggered by a threshold, forks a sub-agent to extract a summary. When compaction is truly needed, injects the ready-made notes directly—saving a summarizer API call.
**Full Compression (Layer 4)**
This is the last line of defense. Forks a dedicated summarizer to generate a 9-section structured summary, quality-enhanced with a draft area before being stripped. After compression, re-injects key context like recent files, skills, plans, and inserts a CompactBoundaryMessage marker as a breakpoint. A circuit breaker activates after 3 consecutive failures to prevent infinite retry loops wasting money.
**Automatic Memory Extraction (Layer 5)**
Runs after each full query cycle, extracting cross-session persistent knowledge to the project memory/ directory. Four memory types: feedback, workflows, environment preferences, tech stack. Mutex mechanism: memories already written by the main agent are not duplicated by the background process. MEMORY.md serves as an index, hard-capped at 200 lines / 25KB.
**Dream System (Layer 6)**
Background cross-session integration, analogous to human memory consolidation during sleep. Four phases: Orient → Gather → Consolidate → Prune. Read-only tool restrictions (ls, grep, cat), file locks prevent concurrency. Users can monitor and terminate via the UI.
**Cross-Agent Communication (Layer 7)**
All background operations run on forked agents: clones mutable states, shares prompt cache prefix. Relies on Agent tool multi-mode spawning and SendMessage to handle memory, compression, approval, and communication.
**Underlying Design Philosophy**
1. **Layered Defense + Lowest Cost First**: Each layer strives to prevent the next from triggering, forming an efficient waterfall.
2. **Extreme Prompt Cache Optimization**: ContentReplacementState freezing, renderedSystemPrompt passthrough, cache_edits, fork prefix byte consistency—nearly every decision revolves around not breaking the cache.
3. **Balance of Isolation and Sharing**: Forking sub-agents prevents state pollution while maximizing cache reuse.
4. **Fault Tolerance & Control**: Circuit breakers, mutex protection, GrowthBook feature flags allow remote disabling of any module, graceful degradation.
5. **Observability**: From UI progress pills to detailed logs, full transparency for background tasks.


For most people, talking about context management means 'how big is the window' or 'how good is the compression algorithm.' Claude Code takes a different path: it doesn't focus on compression itself, but on how to make compression happen less often, or even not at all. Tool results go straight to disk, kept out of context whenever possible; what micro-compression can clean up, never waits for session memory; what session memory can handle, never calls the full summarizer.
Put simply, the core of this system isn't 'memory' at all—it's **cache protection**. All the fancy mechanisms—ContentReplacementState freezing, cache_edits, fork prefix byte consistency—point to the same goal: not letting that ~1-hour TTL server-side prompt cache expire. Every cache miss costs ~200 times more to re-tokenize 200K tokens than a hit. That's the real cost driver.
For an AI coding tool, features are certainly important. But what Claude Code demonstrates is this: true engineering strength isn't about how much you can do, but about knowing where to save.
发布时间: 2026-04-01 13:35