Building a Persistent Memory

AI agents are great at handling the current turn, but terrible at remembering what matters over time. Most "memory" implementations are just vector search bolted onto a chat history. That works for finding similar text, but it breaks down when you need actual truth, timelines, relationships, and context that changes.

Think about it: if you ask an agent "What car does this user drive?" and they bought a Honda in March and a Tesla in November, vector search might return either answer. Both are semantically similar to the question. But only one is true right now.

This is the problem I set out to solve.

The problems with typical RAG memory

When I started building agent experiences, I kept hitting the same walls:

Similarity isn't truth. Embeddings happily return contradictory facts if they're semantically close. Ask about a user's preferences and you might get answers from six months ago mixed with answers from yesterday.

Time gets ignored. Old facts drown out recent ones. There's no built-in notion of "this information supersedes that information."

Relationships are invisible. Vector search can't answer questions like "Who owns this project?" or "What companies has this user worked with?" Those require understanding how things connect, not just how similar the words are.

Conflicts pile up. When information changes, the system can't reconcile it. You end up with contradictions sitting side by side in your context window.

Every domain is different. A CRM cares about clients and deals. A project management tool cares about tasks and deadlines. Generic memory systems treat everything the same.

The core insight: combine vectors with a graph

The solution I landed on uses two complementary retrieval strategies:

Vector similarity search for semantic recall ("find me things related to X")
Knowledge graph queries for precise relationships ("who owns what")

Neither approach works well alone. Vectors are fuzzy but fast. Graphs are precise but require structure. Together, they cover each other's weaknesses.

How I structured the memory

I store memory at four levels, each serving a different purpose:

1. Raw resources — The original conversation or document, stored in S3. Never modified. This is your audit trail and source of truth for "what was actually said."

2. Extracted facts — Atomic pieces of information pulled out by an LLM. Stored in PostgreSQL with vector embeddings. Things like "User prefers dark mode" or "Project deadline is March 15."

3. Category summaries — Evolving summaries for each type of information. Instead of returning 50 individual facts about user preferences, you get one coherent summary that represents the current state.

4. Graph relationships — Entities and how they connect. Stored in Neo4j. "User WORKS_ON Project" or "Client OWNS Account."

A single conversation might produce: one raw resource, a dozen extracted facts, three category summaries, and six graph edges. Each layer exists because it solves a specific retrieval problem.

The write path: how information gets stored

When new content arrives—a conversation, an event, a webhook—here's what happens:

Store the raw content (never lose the original)
Use an LLM to extract atomic facts
Classify each fact into domain-specific categories
Update the relevant category summaries, handling conflicts
Extract entities and relationships into the graph

The key insight: extraction and storage happen asynchronously. The system doesn't block waiting for the graph to update. If Neo4j is slow or unavailable, the vector-based memory still works.

The read path: how context gets assembled

When an agent needs context, the system:

Figures out which categories are relevant to the query
Checks if category summaries are fresh and confident enough
If summaries suffice, returns those (fast, concise)
If not, falls back to vector search for specific facts
Queries the graph for relationship context
Applies time decay so recent information ranks higher
Assembles everything within the token budget

This layered approach means most queries get answered from summaries (cheap), with vector search as a fallback (thorough), and graph queries for relationship questions (precise).

Teaching the system what matters

Here's where it gets interesting. Every application can define what's worth remembering:

const config = {
  // What types of information to track
  categories: ["project_context", "user_preferences", "timeline"],

  // What entities exist in your domain
  entityTypes: ["Client", "Project", "Account"],

  // How entities relate to each other
  relationshipTypes: ["OWNS", "WORKS_ON", "REPORTS_TO"],

  // What to always/never remember
  relevance: {
    always: ["budget", "deadline"],
    never: ["password"]
  },
};

This domain awareness is what makes the difference. The system can evolve summaries, extract relationships, and prioritize retrieval based on what actually matters in your context.

Learning and forgetting

The system "learns" by continuously extracting facts and evolving summaries. Over time, those summaries become the stable, long-term memory.

But forgetting is just as important. Without it, memory fills with noise.

Explicit deletion handles GDPR and user requests. Call an endpoint, information gets removed.

Time decay automatically down-ranks stale facts. Something mentioned once six months ago matters less than something mentioned repeatedly last week.

Conflict resolution supersedes old information. When a user updates their preferences, the new version wins. The old version stays in the raw log for auditing, but the summary reflects current state.

Domain rules mark certain information as never-remembered (passwords) or short-lived (session tokens).

This balance—learning what matters and forgetting what doesn't—keeps context accurate and relevant.

Why timelines beat similarity

Back to the car example:

Timeline
2024-03-10   [Event] Purchased Honda Civic
     |
     |  (later)
     v
2024-11-22   [Event] Purchased Tesla Model 3  --> Current car

With pure vector search, "What car does the user have?" might return either event. Both are semantically relevant.

With timeline-aware summaries, the system knows the Tesla supersedes the Civic. The summary converges on "current car: Tesla Model 3." The Honda is preserved in history but doesn't pollute current context.

This is the difference between "find something similar" and "answer what's true now."

Maintenance keeps memory healthy

Memory needs upkeep:

Nightly consolidation merges redundant facts
Weekly summarization compresses old details into higher-level knowledge
Monthly re-indexing refreshes embeddings as the corpus changes

Without maintenance, memory bloats and retrieval quality degrades.

Short-term vs long-term

One clarification: this system handles long-term memory—stable knowledge that persists across sessions.

Short-term memory (what was said 30 seconds ago) is a different problem. I handle that with checkpointing in the agent runtime. The agent's state gets captured so you can replay, resume, and debug conversations. That's orthogonal to the persistent memory layer.

What I learned

Building this taught me a few things:

Memory is infrastructure, not a feature. Treat it like a database, not an afterthought.

Summaries are underrated. Most queries don't need every individual fact. A well-maintained summary is faster and more useful.

Forgetting is a feature. Systems that only accumulate eventually drown in noise.

Domain awareness matters. Generic solutions work generically. Knowing what entities and relationships matter in your domain unlocks precision that vectors alone can't provide.

AI agents need more than embeddings to build lasting context. They need memory that understands time, truth, relationships, and domain nuance. That's the problem I was trying to solve, and this architecture is my current answer.