Skip to content

Research: Knowledge System for AI Agents

Purpose: Deep research into agent memory architectures, academic literature, open source tooling, and design options for Kaze's shared knowledge system — the "Wikipedia for Agents." Status: Research Document Last Updated: 2026-02-26


Table of Contents

  1. The Problem We're Solving
  2. Academic Literature Survey
  3. Cross-Cutting Patterns from Research
  4. Open Source Tooling Survey
  5. Architecture Options for Kaze
  6. Recommendation
  7. References

1. The Problem We're Solving

1.1 Why Agent Memory Matters

LLMs are stateless by default. Every conversation starts from zero. For agents that operate continuously across days, weeks, and months — processing invoices, managing SEO campaigns, handling CRM workflows — this is a fundamental limitation.

The problem has multiple dimensions:

  • Within a single agent: How does an agent remember what it did yesterday, what it learned from corrections, and what context it needs for the current task?
  • Across agents in the same vertical: How does one SEO agent's discovery ("this keyword strategy works for e-commerce") benefit all other SEO agents?
  • Across verticals: How do patterns learned in one domain ("clients prefer bullet-point reports") transfer to others?
  • Across time: How does knowledge stay current, get refined, and avoid rot?

1.2 The "Wikipedia for Agents" Vision

We envision a shared knowledge system where:

  • Agents can read knowledge contributed by other agents
  • Agents can write new knowledge based on their experiences and discoveries
  • Knowledge is structured (not just a bag of text) with relationships, categories, and hierarchies
  • Knowledge is versioned — every change is tracked with provenance (who changed what, when, why)
  • Knowledge has access control — some knowledge is private to an agent or client, some is shared across the vertical, some is platform-wide
  • Knowledge improves over time through consolidation, refinement, and feedback loops
  • Knowledge is retrievable through multiple strategies — semantic search, graph traversal, and direct lookup

This is analogous to how Wikipedia works for humans: a collaborative, structured, versioned, ever-improving knowledge base that anyone can read and contribute to, with editorial quality controls.

1.3 Memory Types We Need

Drawing from cognitive science (formalized in the CoALA framework), the knowledge system must support distinct types of memory:

Memory TypeCognitive AnalogWhat It StoresExample in Kaze
Semantic Memory"What is true"Facts, concepts, relationships, domain knowledge"Title tags should be under 60 characters for SEO"
Episodic Memory"What happened"Timestamped records of specific events and experiences"Client A's site lost 30% traffic on Jan 5 after a Google update"
Procedural Memory"How to do it"Action sequences, workflows, tool usage patterns, code"To run a site audit: call Ahrefs API → parse results → generate report"
Working Memory"What I'm thinking now"Current task context, in-progress reasoningThe agent's active context window during a task
Reflective Memory"What I've learned"Synthesized insights derived from experience"International invoices need extra tax validation — learned from 15% error rate in Q1"

Each type has different storage, retrieval, and lifecycle characteristics. A one-size-fits-all approach won't work.


2. Academic Literature Survey

2.1 MemGPT / Letta — Virtual Context Management

Paper: MemGPT: Towards LLMs as Operating Systems (Packer et al., Oct 2023) Link: https://arxiv.org/abs/2310.08560Evolved into: Letta (open source framework)

Core Idea

MemGPT draws a direct analogy from operating system virtual memory. It treats the LLM's context window as "RAM" and external storage as "disk," creating an illusion of unlimited memory within fixed context limits.

Memory Architecture

Three tiers:

┌──────────────────────────────┐
│  CORE MEMORY (in-context)    │  ← Always visible to the LLM
│  Fixed-size writeable block  │  ← Agent persona + key user facts
│  Analogous to registers/L1   │  ← Modified via explicit function calls
├──────────────────────────────┤
│  RECALL MEMORY (external)    │  ← Complete conversation history
│  Searchable via function     │  ← Summarized chunks from evicted context
│  calls                       │  ← Nothing is ever lost
├──────────────────────────────┤
│  ARCHIVAL MEMORY (external)  │  ← General-purpose long-term storage
│  Read-write datastore        │  ← Can use vector DB, graph DB, etc.
│  The agent's "filing cabinet"│  ← Processed, indexed information
└──────────────────────────────┘

How It Works

  • When the context window fills up, a queue manager evicts the oldest messages
  • Evicted messages are recursively summarized and stored in recall memory
  • The LLM itself decides what to page in/out via function calls (archival_memory_search(query), archival_memory_insert(content))
  • The agent actively manages its own memory rather than relying on a passive retrieval pipeline

Short-term vs Long-term

  • Short-term = core memory (what fits in the context window)
  • Long-term = recall + archival memory stored externally
  • The boundary is managed by a page-in/page-out mechanism, similar to OS virtual memory

Retrieval

Agent-initiated via tool calls. The LLM explicitly decides when to search memory and what to search for. This gives the agent agency over its own memory management, but also means the agent can "forget" to look things up.

2026 Evolution — Letta Context Repositories

As of February 2026, Letta introduced Context Repositories — a major rethinking of agent memory:

  • Agent context is stored as files in a git-backed filesystem
  • Agents can spawn subagents to reorganize their own memory
  • Multi-agent concurrent memory writing with git-based conflict resolution
  • Knowledge is versioned with full history, branching, and merging

This is architecturally very close to a wiki. It demonstrates that git-style version control on knowledge is a viable coordination primitive for multi-agent systems.

Strengths

  • Elegant OS analogy makes the architecture intuitive and composable
  • Agent has agency over its own memory (self-directed read/write)
  • No information is permanently lost (conversation history fully preserved)
  • Extensible — archival storage can be backed by any datastore
  • Context Repositories bring versioning and multi-agent coordination

Weaknesses

  • Relies on the LLM correctly deciding when to page in/out (failure mode: forgetting to retrieve relevant information)
  • Multiple LLM calls for memory management add latency and token cost
  • Core memory has a fixed size — no dynamic expansion
  • No built-in importance scoring or automatic consolidation in the base system

Relevance to Kaze

High. The tiered memory model is a good fit for per-agent memory. The Context Repositories pattern (git-based versioning) is directly applicable to our shared knowledge system — it shows that treating knowledge like code (versioned, branchable, mergeable) works in practice.


2.2 Generative Agents — Stanford (Park et al.)

Paper: Generative Agents: Interactive Simulacra of Human Behavior (Park et al., April 2023) Link: https://arxiv.org/abs/2304.03442Published at: UIST 2023

Core Idea

25 AI agents living in a simulated town, remembering experiences, reflecting on them, and forming long-term behaviors. The paper introduced the most influential memory retrieval formula in the field.

Memory Architecture

Centered on a Memory Stream — a comprehensive, append-only log of all agent experiences recorded in natural language.

┌─────────────────────────────────────────┐
│            MEMORY STREAM                 │
│  (append-only log of all experiences)   │
│                                          │
│  Entry 1: "Klaus saw Emily painting"     │
│  Entry 2: "Klaus ate breakfast at cafe"  │
│  Entry 3: "Klaus talked to Sam about..." │
│  ...                                     │
│  Entry N: (latest observation)           │
└───────────┬──────────────────────────────┘

    ┌───────┴───────┐
    │               │
┌───┴───┐    ┌──────┴──────┐    ┌────────────┐
│RETRIEVE│    │  REFLECT    │    │   PLAN     │
│        │    │             │    │            │
│Score & │    │Synthesize   │    │High-level  │
│rank    │    │memories into│    │plans →     │
│memories│    │higher-level │    │detailed    │
│        │    │abstractions │    │actions     │
└────────┘    └─────────────┘    └────────────┘

Three processes operate on the memory stream:

  1. Retrieval — A scoring function surfaces relevant memories on demand
  2. Reflection — Periodically synthesizes memories into higher-level abstractions (e.g., many small observations about Klaus → "Klaus is interested in painting")
  3. Planning — Generates high-level plans, recursively decomposed into detailed action sequences

The Tri-Factor Retrieval Formula

This is the paper's most lasting contribution — the de facto standard for agent memory retrieval:

score = α_recency × recency + α_importance × importance + α_relevance × relevance

Where:

  • Recency — Exponential decay function based on time since last access. Recent memories score higher. Decay factor of 0.995 per game hour.
  • Importance — LLM-assigned integer score (1-10) distinguishing mundane events from significant ones. "Ate breakfast" = 1, "Had a breakup" = 9. Scored once at creation time.
  • Relevance — Cosine similarity between embedding vectors of the memory and the current query/context.

All three scores are min-max normalized to [0,1]. In the implementation, all α weights are set to 1 (equal weighting). Top-ranked memories that fit in the context window are included.

The Reflection Mechanism

Reflections are higher-order memories synthesized from lower-level observations:

  1. Triggered when the sum of importance scores of recent memories exceeds a threshold
  2. The agent generates "questions that can be answered given the most recent observations"
  3. For each question, retrieve relevant memories using the tri-factor formula
  4. Synthesize retrieved memories into abstract statements (reflections)
  5. Reflections are stored back in the memory stream as first-class entries (with their own importance scores)
  6. Reflections can be retrieved and used to generate even higher-order reflections

This creates a hierarchical abstraction ladder: raw observations → first-order reflections → second-order reflections → beliefs.

Strengths

  • The tri-factor retrieval formula is simple, effective, and widely adopted across the field
  • Reflection mechanism enables emergent higher-order reasoning about experiences
  • Natural language storage is human-readable and debuggable
  • Proved that believable long-term agent behavior is achievable with relatively simple mechanisms
  • The memory stream is append-only, which is great for audit trails

Weaknesses

  • Memory stream grows without bound — no garbage collection, compression, or consolidation
  • Importance scoring relies on LLM judgment (noisy, and requires an LLM call per memory at creation)
  • Equal weighting of α values is a design choice, not learned from data
  • Reflection is triggered by a fixed threshold, not adaptively based on need
  • Single-agent design — no mechanism for shared memory across agents
  • Embedding-based relevance can miss important memories that are conceptually related but lexically different

Relevance to Kaze

The tri-factor retrieval formula is our baseline retrieval mechanism. We should implement recency + importance + relevance scoring for memory retrieval, potentially with learned or configurable weights rather than fixed equal weighting.

The reflection mechanism maps directly to our self-improvement loop — agents synthesizing experiences into reusable knowledge. The key extension we need is making reflections shared (contributing to the vertical knowledge graph) rather than agent-private.


2.3 Voyager — Skill Library as Procedural Memory

Paper: Voyager: An Open-Ended Embodied Agent with Large Language Models (Wang et al., May 2023) Link: https://arxiv.org/abs/2305.16291From: NVIDIA / MineDojo

Core Idea

An AI agent that plays Minecraft open-endedly, building an ever-growing library of reusable skills (as executable JavaScript code). Demonstrates lifelong learning: the agent continuously acquires new capabilities by composing previously learned skills.

Memory Architecture

Voyager's memory is fundamentally procedural — it stores how to do things as executable code:

┌─────────────────────────────────────────┐
│            SKILL LIBRARY                 │
│                                          │
│  Key: embedding(skill_description)       │
│  Value: executable JavaScript program    │
│                                          │
│  "mine_diamond_ore": {                   │
│    description: "Mine diamond ore at...", │
│    code: "async function mineDiamond(){  │
│      await bot.equip('iron_pickaxe');    │
│      await bot.dig(nearestDiamond);      │
│    }"                                    │
│  }                                       │
│                                          │
│  Skills grow monotonically               │
│  Complex skills compose simpler ones     │
└─────────────────────────────────────────┘

┌─────────────────────────────────────────┐
│        AUTOMATIC CURRICULUM              │
│                                          │
│  GPT-4 generates exploration objectives  │
│  based on current state + inventory +    │
│  exploration progress                    │
└─────────────────────────────────────────┘

┌─────────────────────────────────────────┐
│     ITERATIVE PROMPTING MECHANISM        │
│                                          │
│  Environment feedback + execution errors │
│  + self-verification → refine program    │
│  before adding to skill library          │
└─────────────────────────────────────────┘

Key Mechanism: Self-Verification Before Storage

Before a skill enters the library, it goes through verification:

  1. The agent writes the code
  2. The code is executed in the environment
  3. Success/failure is evaluated
  4. If it fails, the agent iterates (up to 3 times) incorporating error messages
  5. Only verified, working skills are added to the library

This is the equivalent of code review before merging to main — directly analogous to Wikipedia's editorial process.

Retrieval

Pure embedding similarity — the current task description is embedded, and the most similar skill descriptions are retrieved from the library. No recency or importance weighting (skills don't decay — "how to mine diamond" is always relevant when you need to mine diamonds).

Composability

Complex skills build on simple ones:

  • "mine_diamond" uses "equip_pickaxe" and "find_ore"
  • "build_house" uses "gather_wood", "craft_planks", "place_block"
  • This creates a dependency graph of skills

Results

  • 3.3x more unique items discovered than baselines
  • Continuous capability growth — skills never forgotten
  • Skills transfer to new environments (the library is portable)

Strengths

  • Procedural memory as code is composable, interpretable, and verifiable
  • Self-verification ensures quality before storage (quality gate)
  • Avoids catastrophic forgetting — skills persist as code, not neural weights
  • Demonstrates lifelong learning with compounding capability
  • Skills have clear input/output contracts

Weaknesses

  • Domain-specific (Minecraft) — the skill format doesn't directly generalize to all domains
  • No episodic or semantic memory — the agent doesn't "remember" events, only procedures
  • Retrieval is embedding-only — no recency, importance, or contextual filtering
  • No mechanism for skill deprecation, versioning, or conflict resolution
  • Single-agent — no shared skill library with access control

Relevance to Kaze

This maps directly to our agent skills model. In our architecture, skills are composable units with inputs, outputs, tool requirements, and quality criteria. Voyager validates this pattern.

Key design takeaways:

  • Skills should be verified before entering the shared library (self-verification / canary)
  • Skills should be composable (complex skills reference simpler ones)
  • Skill retrieval should be by semantic similarity to the current task
  • The skill library should grow monotonically (skills are versioned, not deleted)

2.4 Reflexion — Verbal Reinforcement Learning

Paper: Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., March 2023) Link: https://arxiv.org/abs/2303.11366Published at: NeurIPS 2023

Core Idea

Instead of learning from scalar rewards (like RL), agents learn from verbal self-reflection. After each attempt at a task, the agent generates a natural-language reflection on what went wrong and how to improve. These reflections are stored and used to guide future attempts.

Memory Architecture

┌─────────────────────────────────────────┐
│              ACTOR                        │
│  LLM that generates actions              │
│  Conditioned on observations + memory    │
└────────────────┬─────────────────────────┘
                 │ trajectory

┌─────────────────────────────────────────┐
│            EVALUATOR                     │
│  Scores the trajectory                   │
│  Binary or scalar reward signal          │
└────────────────┬─────────────────────────┘
                 │ reward + trajectory

┌─────────────────────────────────────────┐
│         SELF-REFLECTION                  │
│  LLM generates verbal feedback           │
│  "I failed because I didn't check..."    │
│  Stored in episodic memory buffer        │
└────────────────┬─────────────────────────┘


┌─────────────────────────────────────────┐
│       EPISODIC MEMORY BUFFER             │
│  Sliding window of past reflections      │
│  (typically last 3 reflections)          │
│  All loaded into context on next trial   │
└─────────────────────────────────────────┘

Short-term vs Long-term

  • Short-term = current trial's trajectory
  • Long-term = episodic memory buffer of past reflections
  • The buffer uses a sliding window (typically 3 reflections), so older reflections fall off

Retrieval

Simple — all stored reflections within the window are loaded into context. No semantic search or scoring. This works because the buffer is deliberately kept small and curated (only reflections, not raw experiences).

Strengths

  • Extremely lightweight — no vector databases, no complex retrieval infrastructure
  • Verbal reinforcement is more informative than scalar rewards ("I failed because X" vs "reward = 0")
  • Learns from failure without any weight updates to the model
  • Reflections are human-readable, interpretable, and debuggable
  • Practical improvement: 91% pass@1 on HumanEval (vs 80% baseline)

Weaknesses

  • Fixed-size sliding window severely limits long-term memory capacity
  • No structured retrieval — as reflection count grows beyond the window, older learnings are lost
  • Reflections can be repetitive or contradictory without curation
  • No mechanism for generalizing across tasks — reflections are task-specific
  • No shared reflections across agents

Relevance to Kaze

The reflection pattern is lightweight and valuable for individual agent self-improvement. In Kaze, each agent could maintain a small reflection buffer (like Reflexion) for its current task context, while the best reflections get promoted to the shared knowledge graph as permanent insights.

The key insight: reflections are the raw material for knowledge graph updates. When an agent reflects "I failed because international invoices need special tax handling," that reflection should be evaluated and potentially promoted to a shared skill improvement.


2.5 CoALA — Cognitive Architectures for Language Agents

Paper: Cognitive Architectures for Language Agents (Sumers, Yao et al., Sept 2023) Link: https://arxiv.org/abs/2309.02427Published at: TMLR 2024

Core Idea

CoALA is not a system — it's a unifying framework that organizes all agent architectures into a coherent cognitive model. It provides the vocabulary and taxonomy that the entire field uses.

The Framework

┌─────────────────────────────────────────────────────────┐
│                     AGENT                                │
│                                                          │
│  ┌──────────────────────────────────────────────────┐   │
│  │              WORKING MEMORY                       │   │
│  │  Short-term scratchpad for current reasoning      │   │
│  │  = LLM context window + in-context state          │   │
│  └──────────────────────────────────────────────────┘   │
│                                                          │
│  ┌──────────────────────────────────────────────────┐   │
│  │            LONG-TERM MEMORY                       │   │
│  │                                                    │   │
│  │  ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │   │
│  │  │  EPISODIC    │ │  SEMANTIC    │ │PROCEDURAL │ │   │
│  │  │              │ │              │ │           │ │   │
│  │  │  Past events │ │  Facts &     │ │ How-to    │ │   │
│  │  │  & experiences│ │  knowledge  │ │ knowledge │ │   │
│  │  └──────────────┘ └──────────────┘ └───────────┘ │   │
│  └──────────────────────────────────────────────────┘   │
│                                                          │
│  ┌──────────────────────────────────────────────────┐   │
│  │              ACTION SPACE                         │   │
│  │                                                    │   │
│  │  Internal actions:     External actions:           │   │
│  │  - Reasoning           - Tool use                  │   │
│  │  - Memory retrieval    - API calls                 │   │
│  │  - Memory writing      - Environment interaction   │   │
│  │  - Learning                                        │   │
│  └──────────────────────────────────────────────────┘   │
│                                                          │
│  Decision loop:                                          │
│  Observe → Retrieve → Reason → Decide → Act → Store     │
└─────────────────────────────────────────────────────────┘

Memory Type Definitions

Episodic Memory:

  • Records of specific past events and experiences
  • Typically stored as natural language with timestamps
  • Retrieved by recency, relevance, or explicit query
  • Examples: conversation logs, task outcomes, error reports
  • Paper mapping: Generative Agents' memory stream, Reflexion's reflection buffer

Semantic Memory:

  • Factual and declarative knowledge about the world
  • Can be stored as text, embeddings, knowledge graph triples, or structured data
  • Retrieved by semantic similarity, graph traversal, or direct lookup
  • Examples: domain facts, entity relationships, rules
  • Paper mapping: AriGraph's knowledge graph, RAG knowledge bases

Procedural Memory:

  • Knowledge of how to perform tasks — action sequences, policies, code
  • Can be stored as code (Voyager), prompts, tool-use patterns, or workflow definitions
  • Retrieved by task similarity
  • Examples: Voyager's skill library, prompt templates, tool chains
  • Paper mapping: Voyager's skills, ReAct-style action patterns

Key Insights from CoALA

  1. Most current systems are incomplete. Many agents have episodic memory but lack procedural memory. Few have all three types.
  2. Memory writing (learning) is as important as memory reading (retrieval). Most systems focus on retrieval; few have principled mechanisms for what to store and when.
  3. The decision between internal and external actions is fundamental. When should an agent think more vs. act? When should it retrieve from memory vs. use a tool?
  4. Consolidation is the missing piece. Most systems accumulate memory without principled compression or refinement.

Relevance to Kaze

CoALA is our design checklist. Our knowledge system must have distinct stores for all three memory types (episodic, semantic, procedural), with clear mechanisms for both reading and writing. The framework helps us avoid building a system that's strong on retrieval but weak on learning.


2.6 AriGraph — Knowledge Graph + Episodic Memory

Paper: AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents (Anokhin et al., July 2024) Link: https://arxiv.org/abs/2407.04363Published at: IJCAI 2025

Core Idea

Combines a structured semantic knowledge graph (entities + relationships) with episodic memory vertices (raw observations), linked together so you can always trace a fact back to its source.

Memory Architecture

Observation: "The key is on the table in the kitchen"

        ┌───────────┴──────────────┐
        │                          │
        ▼                          ▼
┌───────────────┐    ┌──────────────────────────────┐
│EPISODIC VERTEX│    │    SEMANTIC GRAPH UPDATE       │
│               │    │                                │
│ Full text of  │    │  (key)──[on]──▶(table)         │
│ observation   │    │  (table)──[in]──▶(kitchen)     │
│ + timestamp   │    │                                │
└───────┬───────┘    └──────────────┬─────────────────┘
        │                          │
        └──────── EPISODIC ────────┘
                   EDGES
          (link observation to
           extracted triplets
           for provenance)

At each timestep:

  1. A new episodic vertex is appended (containing the full textual observation)
  2. The LLM parses the observation to extract relationship triplets (entity1, relation, entity2)
  3. Triplets update the semantic memory graph (nodes = entities, edges = relationships)
  4. Episodic edges link each episodic vertex to the triplets it produced, preserving provenance

Retrieval

Two strategies combined:

  • Graph traversal — follow edges to reason about multi-hop relationships ("what room is the key in?" → key→on→table→in→kitchen)
  • Embedding similarity — for fuzzy matching when exact graph paths don't exist
  • Provenance lookup — trace any semantic fact back to the raw observation that produced it

Strengths

  • Structured representation enables multi-hop reasoning that flat text retrieval cannot do
  • Dual episodic/semantic storage preserves both raw data and derived knowledge
  • Provenance tracking — episodic edges linking to triplets support full auditability ("why does the system believe X? Because Agent Y observed Z on date W")
  • Outperforms both flat memory and pure RL baselines on complex reasoning tasks

Weaknesses

  • Triplet extraction quality depends heavily on LLM parsing accuracy — noisy extraction leads to noisy graphs
  • Graph can grow very large with no built-in pruning or consolidation
  • Primarily tested on text adventure games — real-world knowledge has more complex relationships than triplets capture
  • No multi-agent or shared graph mechanisms

Relevance to Kaze

High relevance. The dual episodic/semantic model with provenance is exactly what we need:

  • Semantic graph = our vertical knowledge ("SEO best practices", "keyword→topic relationships")
  • Episodic vertices = agent experiences ("Agent A processed Client B's audit on date C")
  • Provenance edges = the link between them ("we know this SEO practice works because Agent A's audit showed these results")

This enables a knowledge system where every fact can be traced back to its source — critical for trust, debugging, and quality control.


2.7 A-MEM — Zettelkasten-Inspired Agentic Memory

Paper: A-MEM: Agentic Memory for LLM Agents (Xu et al., Feb 2025) Link: https://arxiv.org/abs/2502.12110

Core Idea

Inspired by the Zettelkasten (slip-box) method of note-taking — a system of interconnected, atomic notes with bidirectional links. Each memory is enriched into a structured note with relationships to other notes, creating a knowledge network.

Memory Architecture

┌──────────────────────────────────────────────────┐
│                  NOTE NETWORK                      │
│                                                    │
│  ┌──────────┐      ┌──────────┐                   │
│  │ Note A   │──────│ Note B   │                   │
│  │ Keywords │      │ Keywords │                   │
│  │ Tags     │◀─────│ Tags     │                   │
│  │ Context  │      │ Context  │                   │
│  └────┬─────┘      └──────────┘                   │
│       │                                            │
│       │ causal link                                │
│       ▼                                            │
│  ┌──────────┐      ┌──────────┐                   │
│  │ Note C   │──────│ Note D   │                   │
│  │          │      │          │                   │
│  └──────────┘      └──────────┘                   │
│                                                    │
│  Notes linked by: causal, conceptual,              │
│  semantic, temporal relationships                  │
└──────────────────────────────────────────────────┘

How It Works

  1. Note Construction: When a new memory arrives, the LLM enriches it into a structured note:

    • Core content (the memory itself)
    • Keywords (extracted key terms)
    • Tags (categorical labels)
    • Contextual description (expanded context)
    • Embedding vector (for similarity search)
  2. Link Generation: The LLM analyzes relationships between the new note and existing notes, identifying:

    • Causal links ("A caused B")
    • Conceptual links ("A is related to B conceptually")
    • Semantic links ("A and B discuss the same topic")
    • Temporal links ("A happened before B")
  3. Memory Evolution: Notes can be updated, re-linked, and reorganized over time as understanding deepens.

  4. Retrieval (Spreading Activation): When querying:

    • The query is embedded and matched against note vectors
    • When a note is retrieved, its linked notes are also automatically surfaced
    • This "spreading activation" discovers non-obvious connections that pure embedding similarity would miss

Strengths

  • Rich interconnected structure enables discovery of non-obvious relationships
  • LLM-driven linking captures nuanced relationships that embedding similarity alone misses
  • Doubles performance on complex multi-hop reasoning tasks compared to flat retrieval
  • Cost-effective despite the multiple LLM calls during note processing
  • The Zettelkasten model is essentially a personal wiki — proven for human knowledge management over decades

Weaknesses

  • Multiple LLM calls per memory operation (note construction + link generation) adds latency
  • Link quality depends entirely on LLM reasoning quality — garbage in, garbage out
  • No built-in access control or multi-agent sharing mechanisms
  • No forgetting or consolidation — the network grows indefinitely
  • Can become computationally expensive for spreading activation as the network grows

Relevance to Kaze

Very high relevance. The Zettelkasten model is the closest existing pattern to a "Wikipedia for agents":

  • Each knowledge article is a note with structured metadata (keywords, tags, context)
  • Articles are interlinked with typed relationships (causal, conceptual, etc.)
  • Retrieval uses spreading activation — finding related knowledge, not just matching keywords
  • The structure supports human browsing (via links and tags) as well as AI retrieval (via embeddings)

The key extension we need: multi-author provenance, access control, and quality gates for shared notes.


2.8 Collaborative Memory — Multi-Agent Shared Memory with Access Control

Paper: Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control (Rezazadeh et al., May 2025) Link: https://arxiv.org/abs/2505.18279

Core Idea

The first paper to explicitly address shared memory across multiple agents/users with access control. Introduces a two-tier memory system with dynamic permission management.

Memory Architecture

┌──────────────────────────────────────────────────┐
│           COLLABORATIVE MEMORY SYSTEM             │
│                                                    │
│  ┌────────────────────────────────────────────┐   │
│  │         PRIVATE MEMORY TIER                 │   │
│  │                                              │   │
│  │  Agent A's private fragments                 │   │
│  │  Agent B's private fragments                 │   │
│  │  (visible only to originator)                │   │
│  └────────────────────────────────────────────┘   │
│                                                    │
│  ┌────────────────────────────────────────────┐   │
│  │         SHARED MEMORY TIER                  │   │
│  │                                              │   │
│  │  Shared fragments with access policies       │   │
│  │  (selectively visible based on permissions)  │   │
│  └────────────────────────────────────────────┘   │
│                                                    │
│  ┌────────────────────────────────────────────┐   │
│  │    DYNAMIC BIPARTITE ACCESS GRAPH           │   │
│  │                                              │   │
│  │  Users ◄──────────► Agents                   │   │
│  │    │                    │                     │   │
│  │    └──── Resources ─────┘                     │   │
│  │                                              │   │
│  │  Time-evolving graph linking users,           │   │
│  │  agents, and memory resources                 │   │
│  └────────────────────────────────────────────┘   │
│                                                    │
│  PROVENANCE: Each fragment carries immutable       │
│  metadata — contributing agents, accessed           │
│  resources, timestamps                              │
└──────────────────────────────────────────────────┘

Key Mechanisms

  1. Two-tier storage: Every memory fragment is either private (agent-local) or shared (published to a collaborative pool). Agents control what they share.

  2. Dynamic Bipartite Access Graphs: Time-evolving graphs that model who can access what. The graph links:

    • Users → Agents (which agents work for which users)
    • Agents → Resources (which memory fragments an agent can access)
    • Users → Resources (what a user has contributed or can view)
  3. Attribute-Based Access Control (ABAC): Permissions are based on attributes (role, project, vertical, client) rather than static role lists. Policies are configurable at system, user, or agent level.

  4. Immutable Provenance: Every memory fragment carries:

    • Which agent(s) contributed it
    • What resources were accessed to create it
    • Timestamps of creation and modification
    • Source attribution chain

Strengths

  • First real solution for multi-agent shared memory with proper access control
  • Dynamic permissions adapt as agents and users change roles
  • Provenance tracking enables full auditability
  • ABAC model is flexible enough for complex organizational structures
  • Private/shared distinction maps naturally to agent-local vs. vertical knowledge

Weaknesses

  • Paper is theoretical — limited real-world validation at scale
  • Access graph can become complex to manage in large deployments
  • No built-in quality gate for shared memory (any agent can publish)
  • No consolidation or contradiction resolution between conflicting shared memories

Relevance to Kaze

Directly applicable. This paper addresses our exact problem — multiple agents sharing knowledge with:

  • Private tier = agent working memory and client-specific knowledge
  • Shared tier = vertical knowledge, cross-vertical patterns
  • Access control = client isolation (Client A's data never visible to Client B's agents), vertical scoping, role-based permissions
  • Provenance = every knowledge update traced to its source agent, client context, and evidence

2.9 MIRIX — Six-Component Multi-Agent Memory

Paper: MIRIX: Multi-Agent Memory System for LLM-Based Agents (Wang & Chen, July 2025) Link: https://arxiv.org/abs/2507.07957

Core Idea

The most comprehensive modular memory system in the literature, decomposing agent memory into six distinct components, each managed by a dedicated Memory Manager agent.

Memory Architecture

┌──────────────────────────────────────────────────────┐
│                    MIRIX SYSTEM                        │
│                                                        │
│  ┌──────────────────────────────────────────────────┐ │
│  │            META MEMORY MANAGER                    │ │
│  │    Routes queries to appropriate component        │ │
│  └──────────┬───────────────────────────────────────┘ │
│             │                                          │
│  ┌──────────┴───────────────────────────────────────┐ │
│  │                                                    │ │
│  │  ┌──────────────┐  ┌──────────────┐               │ │
│  │  │ CORE MEMORY  │  │  EPISODIC    │               │ │
│  │  │              │  │  MEMORY      │               │ │
│  │  │ Persistent   │  │              │               │ │
│  │  │ personalized │  │ Time-stamped │               │ │
│  │  │ data (user & │  │ user-specific│               │ │
│  │  │ agent        │  │ events       │               │ │
│  │  │ profiles)    │  │              │               │ │
│  │  └──────────────┘  └──────────────┘               │ │
│  │                                                    │ │
│  │  ┌──────────────┐  ┌──────────────┐               │ │
│  │  │ SEMANTIC     │  │ PROCEDURAL   │               │ │
│  │  │ MEMORY       │  │ MEMORY       │               │ │
│  │  │              │  │              │               │ │
│  │  │ General      │  │ Actionable   │               │ │
│  │  │ knowledge &  │  │ workflows &  │               │ │
│  │  │ social graphs│  │ scripts      │               │ │
│  │  └──────────────┘  └──────────────┘               │ │
│  │                                                    │ │
│  │  ┌──────────────┐  ┌──────────────┐               │ │
│  │  │ RESOURCE     │  │ KNOWLEDGE    │               │ │
│  │  │ MEMORY       │  │ VAULT        │               │ │
│  │  │              │  │              │               │ │
│  │  │ Documents,   │  │ Critical     │               │ │
│  │  │ files, and   │  │ verbatim     │               │ │
│  │  │ media        │  │ information  │               │ │
│  │  └──────────────┘  └──────────────┘               │ │
│  │                                                    │ │
│  └──────────────────────────────────────────────────┘ │
│                                                        │
│  Each component has its own Memory Manager agent       │
│  Meta Memory Manager coordinates routing               │
└──────────────────────────────────────────────────────┘

The Six Components

ComponentWhat It StoresRetrievalUpdate Frequency
Core MemoryAgent/user profiles, persistent personalized dataDirect lookup by entityInfrequent (profile changes)
Episodic MemoryTime-stamped user-specific eventsRecency + relevance scoringPer-interaction
Semantic MemoryGeneral knowledge, social graphs, domain factsGraph traversal + similarityAs new knowledge is discovered
Procedural MemoryWorkflows, scripts, action templatesTask similarity matchingWhen procedures are learned/updated
Resource MemoryDocuments, files, media referenced by agentsMetadata + content searchWhen new resources are ingested
Knowledge VaultCritical verbatim information (addresses, credentials, exact quotes)Exact match + keyword lookupRare (critical data changes slowly)

Meta Memory Manager

The routing layer that decides which component(s) to query for a given request. Uses intent classification to route:

  • "What's the user's email?" → Knowledge Vault
  • "What happened last Tuesday?" → Episodic Memory
  • "How do I run a site audit?" → Procedural Memory
  • "What does this document say about X?" → Resource Memory

Performance

  • 35% improvement over RAG baselines
  • 99.9% storage reduction compared to storing raw conversation history
  • 85.38% accuracy on LOCOMO benchmark (state of the art at time of publication)

Strengths

  • Most granular memory decomposition — each type is optimized for its specific access patterns
  • Memory Manager agents provide intelligent routing (no need for the calling agent to know which component to query)
  • Modular — components can be independently scaled, optimized, or replaced
  • Storage efficiency from proper categorization (not everything needs vector embedding)

Weaknesses

  • Six components + six managers + one meta-manager = significant system complexity
  • Manager agents themselves consume LLM tokens for routing decisions
  • No multi-tenant or shared memory across agents/users
  • Coordination between managers can introduce latency
  • May be over-engineered for simpler use cases

Relevance to Kaze

MIRIX provides a practical template for decomposing our knowledge system. Not all six components may be needed immediately, but the taxonomy is valuable:

  • Core Memory → our agent and client profiles
  • Episodic Memory → our agent execution logs and experience records
  • Semantic Memory → our vertical knowledge graph
  • Procedural Memory → our agent skills library
  • Resource Memory → documents, SOPs, manuals that agents reference
  • Knowledge Vault → client credentials, critical business data (handled by Vault in our stack)

The Meta Memory Manager pattern is interesting — a routing agent that knows which knowledge store to query. This could be a component of our orchestration layer.


2.10 Memory in the Age of AI Agents — Survey Paper

Paper: Memory in the Age of AI Agents (Liu et al., Dec 2025) Link: https://arxiv.org/abs/2512.13564

Core Contribution

The most comprehensive survey of agent memory, proposing a three-axis taxonomy:

Axis 1: Forms (how memory is stored)

  • Token-level: In-context memory (the conversation itself)
  • Parametric: Fine-tuned into model weights (LoRA, etc.)
  • Latent: Stored as embedding vectors or hidden states

Axis 2: Functions (what memory is for)

  • Factual: Knowledge about the world
  • Experiential: Records of past events and outcomes
  • Working: Active task context

Axis 3: Dynamics (how memory changes)

  • Formation: How new memories are created
  • Evolution: How memories are updated, consolidated, or forgotten
  • Retrieval: How memories are accessed when needed

Key Findings from the Survey

  1. Most systems focus on retrieval, neglecting formation and evolution. Building good retrieval is necessary but not sufficient — you also need principled mechanisms for deciding what to remember and how to update existing knowledge.

  2. Consolidation is the biggest gap. Most systems accumulate memories indefinitely. The episodic-to-semantic consolidation pathway (turning raw experiences into structured knowledge) is the key unsolved problem for scalable shared knowledge.

  3. Evaluation benchmarks are immature. There's no standard benchmark for long-term agent memory quality, making it hard to compare approaches objectively.

  4. Multi-agent memory is nascent. Shared memory, knowledge transfer across agents, and collaborative knowledge building are identified as frontier research problems.


3. Cross-Cutting Patterns from Research

3.1 Memory Taxonomy Convergence

The field has converged on 3-6 memory types rooted in cognitive science:

Memory TypePapers That Use ItStorage ApproachRetrieval Approach
EpisodicGenerative Agents, AriGraph, CoALA, MIRIX, Collaborative MemoryTimestamped natural language entriesRecency + relevance + importance
SemanticAriGraph, CoALA, MIRIX, A-MEMKnowledge graph triples, structured notesGraph traversal + embedding similarity
ProceduralVoyager, CoALA, MIRIXExecutable code, workflow definitionsTask similarity matching
WorkingMemGPT (core memory), CoALAContext window contentsAlways in-context
ReflectiveGenerative Agents, ReflexionNatural language insightsLoaded directly or by relevance

3.2 Retrieval Strategy Spectrum

From simple to sophisticated:

LevelStrategyUsed ByComplexityQuality
1Full context loadingReflexionTrivialOnly works with tiny memory
2Embedding similarity onlyVoyagerLowMisses temporal/importance signals
3Tri-factor scoring (recency + importance + relevance)Generative AgentsMediumDe facto standard, good baseline
4Agent-initiated retrieval (LLM decides when/what to search)MemGPT/LettaMediumMore flexible, but depends on LLM judgment
5Graph traversal + spreading activationAriGraph, A-MEMHighBest for multi-hop reasoning, discovers non-obvious connections
6Learned/adaptive retrieval (dynamic weight adjustment)Emerging researchHighFrontier — not yet proven at scale

For Kaze: We should implement Level 3 (tri-factor) as baseline, with Level 5 (graph traversal) for the structured knowledge graph, and Level 4 (agent-initiated) for agent autonomy over their own memory.

3.3 Memory Consolidation Patterns

Three pathways identified across the literature:

  1. Episodic-to-Semantic Consolidation Raw experiences are distilled into general knowledge.

    • Generative Agents: many observations → reflections → beliefs
    • AriGraph: observations → triplet extraction → knowledge graph
    • For Kaze: Agent experiences → quality-checked insights → vertical knowledge graph updates
  2. Summarization / Compression Long content compressed to preserve signal while reducing tokens.

    • MemGPT: recursive summarization of evicted context
    • For Kaze: Conversation histories and execution logs → compressed summaries → retained in episodic memory
  3. Experience Distillation Interaction trajectories converted to reusable procedures.

    • Voyager: gameplay → verified skill code
    • Reflexion: trial trajectories → verbal reflections
    • For Kaze: Agent task execution → verified skill improvements → skill library updates

3.4 Multi-Agent Knowledge Sharing Patterns

Four emerging approaches:

PatternDescriptionStrongest ExampleMaturity
Shared BlackboardAgents read/write to a common databaseBasic multi-agent systemsMature but simplistic
Provenance-Tracked FragmentsEach memory carries metadata about originCollaborative Memory (2025)Emerging
Git-Based Versioned KnowledgeKnowledge versioned like code with branching and mergingLetta Context Repositories (2026)Emerging
Orchestrator MergesA meta-agent merges specialist contributions into global storeVarious multi-agent coding systemsModerate

For Kaze: Combine provenance-tracked fragments (Collaborative Memory) with git-like versioning (Letta) for the shared knowledge layer.

3.5 Synthesis: Design Principles for Kaze's Knowledge System

Drawing from the entire literature:

  1. Type your knowledge. Distinct stores for episodic, semantic, procedural, and reflective memory. Each has different storage, retrieval, and lifecycle characteristics.

  2. Layer your retrieval. Tri-factor scoring (recency + importance + relevance) as baseline, graph traversal for structured knowledge, agent-initiated search for autonomy.

  3. Version everything. Every write is a versioned commit with agent identity, timestamp, and source attribution. Treat knowledge like code — branchable, mergeable, diffable.

  4. Track provenance. Every fact traces back to its source observation, agent, and evidence chain. Essential for trust, debugging, and quality control.

  5. Gate quality. New knowledge goes through verification before entering the shared store (Voyager's self-verification, Wikipedia's editorial process). The governance layer reviews proposed knowledge changes.

  6. Consolidate actively. Don't just accumulate — periodically distill experiences into knowledge, compress stale memories, detect and resolve contradictions.

  7. Control access. Private (agent-local) vs. shared (vertical) vs. global (cross-vertical). Dynamic, attribute-based access control per the Collaborative Memory model.


4. Open Source Tooling Survey

4.1 Dedicated Agent Memory Frameworks

Mem0

  • What: Universal memory layer for AI agents. Extracts, consolidates, and retrieves information from conversations. Enhanced variant (Mem0g) uses graph-based memory for relational reasoning.
  • Source: https://github.com/mem0ai/mem0
  • License: Apache 2.0 (open source) + managed cloud offering
  • Self-hostable: Yes — supports Kubernetes, air-gapped servers, private clouds
  • Maturity: High — 186M API calls/quarter (Q3 2025), $24M raised (Oct 2025), AWS partnership (exclusive memory provider for AWS Agent SDK)
  • Key features:
    • Single API: mem0.add(), mem0.search(), mem0.get_all()
    • Automatic memory extraction from conversations
    • Graph memory variant for temporal and relational reasoning
    • SOC 2 & HIPAA compliant, BYOK (Bring Your Own Key)
    • Every memory is timestamped, versioned, and exportable
    • 26% accuracy improvement, 91% lower p95 latency, 90% token savings vs. raw context
  • Limitations:
    • Primarily designed for per-user/per-agent memory, not a shared knowledge graph
    • Graph memory (Mem0g) is newer and less battle-tested
    • Managed cloud has the most features; open source version may lag
  • Fit for Kaze: Strong candidate for per-agent memory layer. Handles the working memory and episodic memory needs well. Not a replacement for the shared vertical knowledge system — it's the memory inside each agent, not the wiki.

Letta (formerly MemGPT)

  • What: Platform for building stateful agents with OS-inspired tiered memory (core/recall/archival). Built by the MemGPT research team.
  • Source: https://github.com/letta-ai/letta
  • License: Apache 2.0
  • Self-hostable: Yes — Docker/Kubernetes
  • Maturity: High — active open source community, 100+ contributors, #1 on Terminal-Bench for model-agnostic coding agents
  • Key features:
    • Three-tier memory: core (always in context), recall (conversation history), archival (long-term)
    • Agent-initiated memory management via tool calls
    • Context Repositories (Feb 2026): git-backed knowledge versioning
    • Conversations API: shared memory across parallel agent interactions
    • Multi-model support (works with any LLM)
  • Limitations:
    • More of a full agent framework than a standalone memory component
    • Adopting Letta means adopting its agent runtime model (may conflict with building our own)
    • Context Repositories are very new (Feb 2026)
  • Fit for Kaze: Strong architectural inspiration, especially Context Repositories. The git-based versioning pattern should inform our knowledge system design. As a direct dependency, it may be too opinionated — we'd be building our agent runtime on top of theirs rather than our own.

Cognee

  • What: Knowledge engine for AI agents. Takes documents, generates knowledge graphs from them, queries via semantic + graph traversal. Auto-learns from feedback.
  • Source: https://github.com/topoteretes/cognee
  • License: Apache 2.0
  • Self-hostable: Yes — containerized, integrates with various graph DBs (Neo4j, Memgraph, FalkorDB) and vector stores
  • Maturity: Medium — growing community, GitHub Secure Open Source certified, active development
  • Key features:
    • Automated knowledge graph construction from documents (5 lines of code)
    • Combines embeddings with graph-based extraction (subject-relation-object triplets)
    • Feedback loop: rated responses update edge weights, improving retrieval over time ("Memify" feature)
    • Lexical chunk retrieval + temporal awareness
    • Pluggable backends (multiple graph DBs and vector stores)
  • Limitations:
    • Less mature than Mem0 or Letta
    • Quality of extracted knowledge graphs depends on source document quality
    • Smaller community and ecosystem
  • Fit for Kaze: Strong candidate as the knowledge graph construction pipeline. It automates the process of turning documents, SOPs, and manuals into structured knowledge graphs — exactly what we need when onboarding a new vertical. The feedback loop (edge weight updates from usage) aligns with our self-improvement goals.

LangChain/LangGraph Memory

  • What: Memory modules for LangChain-based agents — conversation buffer, summary memory, entity memory, vector-backed retrieval.
  • Source: https://github.com/langchain-ai/langchain
  • License: MIT
  • Self-hostable: Yes (library, not infrastructure)
  • Maturity: High ecosystem adoption, but memory modules are relatively basic
  • Key features:
    • ConversationBufferMemory, ConversationSummaryMemory, EntityMemory
    • Vector store-backed retrieval memory
    • LangGraph adds stateful graph-based workflows with persistence
  • Limitations:
    • Memory modules are designed for single-agent conversation memory, not shared knowledge
    • Tightly coupled to LangChain abstractions
    • No knowledge graph construction, no provenance tracking, no access control
    • Basic compared to Mem0, Letta, or Cognee
  • Fit for Kaze: Weak. Too basic and too LangChain-specific. Individual features (vector retrieval, summary compression) are useful patterns but better implemented directly or via more capable tools.

Microsoft GraphRAG

  • What: A pipeline that auto-extracts knowledge graphs from text documents, builds community hierarchies, and generates summaries for RAG-based querying.
  • Source: https://github.com/microsoft/graphrag
  • License: MIT
  • Self-hostable: Yes — Python package, runs anywhere
  • Maturity: High — backed by Microsoft Research, active development, growing adoption
  • Key features:
    • Automated entity and relationship extraction from text
    • Community detection and hierarchical clustering
    • Multi-level summarization (local and global)
    • Designed for complex "global" queries across large document collections
  • Limitations:
    • Knowledge graph extraction costs 3-5x more than baseline RAG (heavy LLM usage)
    • Designed for document collections, not real-time agent memory
    • No built-in agent integration — it's a pipeline, not a runtime
    • No incremental updates — reprocessing the full corpus for changes is expensive
  • Fit for Kaze: Strong as a knowledge graph construction tool, especially for initial onboarding of verticals (converting existing SOPs, manuals, and documentation into structured knowledge). Less suitable for real-time agent memory. Could complement Cognee — use GraphRAG for initial bulk knowledge graph construction, Cognee for ongoing incremental updates.

4.2 Vector Databases

Pgvector (PostgreSQL Extension)

  • What: Adds vector data types, similarity search (cosine, L2, inner product), and indexing (IVFFlat, HNSW) to PostgreSQL.
  • Source: https://github.com/pgvector/pgvector
  • License: PostgreSQL License (permissive)
  • Self-hostable: Yes — it's PostgreSQL
  • K8s: Native via CloudNativePG operator (already in our stack)
  • Maturity: High — widely adopted, well-maintained, battle-tested
  • Performance: Good for up to ~100M vectors. pgvectorscale extension achieves 471 QPS at 99% recall on 50M vectors.
  • Key strength: Zero new infrastructure — adds vector capabilities to the Postgres we already have. SQL joins across relational + vector data.
  • Key weakness: Not as performant as dedicated vector databases at extreme scale. Limited to what Postgres can handle.
  • Fit for Kaze: Best starting point. We already committed to PostgreSQL. Adding pgvector means vector search with no new infrastructure. Sufficient for early-to-mid scale. Can be complemented with a dedicated vector DB later if needed.

Qdrant

  • What: Purpose-built high-performance vector database written in Rust. Designed for production AI workloads.
  • Source: https://github.com/qdrant/qdrant
  • License: Apache 2.0
  • Self-hostable: Yes — Docker, Kubernetes (Helm charts available)
  • K8s: Native support, horizontal scaling
  • Maturity: High — production-proven, growing enterprise adoption
  • Performance: Fastest in benchmarks for pure vector search. Excellent filtering performance with payload indexing.
  • Key strength: Best pure performance, advanced filtering and payload indexing, Rust-native stability
  • Key weakness: Another system to operate. Adds operational complexity over pgvector.
  • Fit for Kaze: Scale-up option. If pgvector becomes a performance bottleneck (likely at >100M vectors or high QPS requirements), Qdrant is the best upgrade path. Keep the architecture compatible from day one.

Weaviate

  • What: Vector database with built-in hybrid search (vector + keyword), graph-like object references, and multi-modal support (text, image, video).
  • Source: https://github.com/weaviate/weaviate
  • License: BSD-3-Clause
  • Self-hostable: Yes — Docker, Kubernetes
  • K8s: Native support
  • Maturity: High — well-funded, large community
  • Key strength: Hybrid search out of the box. Built-in cross-references between objects (graph-like). GraphQL API.
  • Key weakness: Slower than Qdrant in pure vector benchmarks. Additional graph features add complexity. gRPC + GraphQL learning curve.
  • Fit for Kaze: Interesting for its hybrid search and cross-reference features, but adds complexity we may not need if we're using a separate graph database. The cross-reference feature partially overlaps with our knowledge graph needs.

Chroma

  • What: Simplest vector database API. Designed for developer experience and rapid prototyping.
  • Source: https://github.com/chroma-core/chroma
  • License: Apache 2.0
  • Self-hostable: Yes — but Kubernetes story is less mature
  • Maturity: Medium — popular for prototyping, less proven at production scale
  • Fit for Kaze: Too lightweight for production multi-tenant deployments. Good for local development/testing.

Milvus

  • What: Cloud-native vector database designed for massive scale. Supports billion-scale vector datasets.
  • Source: https://github.com/milvus-io/milvus
  • License: Apache 2.0
  • Self-hostable: Yes — Kubernetes-native (distributed architecture with separate storage and compute)
  • Maturity: High — CNCF graduated project, large community
  • Key strength: Designed for billion-scale. Distributed architecture. Strong Kubernetes story.
  • Key weakness: Complex to operate — many moving parts (proxy, query node, data node, index node, etcd, MinIO, Pulsar). Overkill for early scale.
  • Fit for Kaze: Too heavy for MVP. Worth revisiting if vector search needs exceed Qdrant's single-node capacity.

Pinecone

  • What: Fully managed vector database. Zero ops.
  • License: Proprietary (managed only)
  • Self-hostable: No
  • Fit for Kaze: Fails cloud-agnostic requirement. Cannot self-host or deploy in customer VPC.

4.3 Graph Databases

Apache AGE (PostgreSQL Extension)

  • What: Adds graph database capabilities to PostgreSQL. Supports the openCypher query language for graph queries alongside standard SQL.
  • Source: https://github.com/apache/age
  • License: Apache 2.0
  • Self-hostable: Yes — it's PostgreSQL
  • K8s: Native via CloudNativePG (same as pgvector)
  • Maturity: Medium — Apache incubating project, growing community, but less mature than Neo4j
  • Key strength: Same rationale as pgvector — adds graph capabilities to our existing Postgres. No new infrastructure. Can query graph data alongside relational and vector data in the same database.
  • Key weakness: Less performant than native graph databases for complex traversals. Cypher support is partial (not full Neo4j Cypher). Smaller ecosystem and community than Neo4j. Less mature graph-specific optimizations.
  • Fit for Kaze: Best starting point for the knowledge graph. Combined with pgvector, gives us relational + vector + graph in one Postgres instance. Sufficient for early-to-mid scale knowledge graphs. Upgrade to a dedicated graph DB only if performance demands it.

FalkorDB

  • What: High-performance graph database built for GraphRAG and AI workloads. Successor to RedisGraph (after Redis Ltd EOL'd it in Jan 2025). Uses sparse matrix algebra for blazing-fast graph algorithms.
  • Source: https://github.com/FalkorDB/FalkorDB
  • License: Server Side Public License (SSPL) — note: this is not a traditional open source license
  • Self-hostable: Yes — Docker, Kubernetes
  • K8s: Supported
  • Maturity: Medium — growing rapidly, dedicated to AI graph workloads
  • Performance: Claims 500x faster p99 and 10x faster p50 latency than Neo4j in aggregate expansion operations
  • Key strength: Purpose-built for GraphRAG. Blazing fast graph traversals. Full Cypher support. Multi-graph architecture.
  • Key weakness: SSPL license may be problematic for some use cases. Smaller community than Neo4j. Less mature ecosystem. Redis-heritage means it's primarily in-memory (implications for very large graphs).
  • Fit for Kaze: Strong upgrade option from Apache AGE. If our knowledge graph grows to a scale where Postgres-based graph queries become a bottleneck, FalkorDB is the performance-oriented choice. The Cypher compatibility makes migration from Apache AGE manageable.

Neo4j

  • What: The most established and mature graph database. Property graph model with the Cypher query language.
  • Source: https://github.com/neo4j/neo4j (Community Edition)
  • License: Community Edition: GPLv3. Enterprise: Commercial license (expensive).
  • Self-hostable: Community Edition is free. Enterprise features (clustering, security, performance) require paid license.
  • K8s: Supported via Helm charts
  • Maturity: Very high — largest ecosystem, most documentation, most graph-trained developers
  • Key strength: Most mature, largest ecosystem, best tooling, most comprehensive Cypher implementation. Recent additions: native vector search capabilities.
  • Key weakness: Enterprise features locked behind expensive licensing. Community Edition lacks clustering and advanced security. GPLv3 for community may complicate our licensing.
  • Fit for Kaze: The obvious choice on maturity alone, but licensing is a concern. If we're deploying into customer VPCs, Neo4j Enterprise licensing costs get passed to every deployment. Community Edition's limitations (no clustering, limited security) may be blockers. Worth revisiting if/when we need enterprise-grade graph capabilities and are willing to pay for them.

ArangoDB

  • What: Multi-model database supporting documents, graphs, key-value, and search in a single engine.
  • Source: https://github.com/arangodb/arangodb
  • License: Apache 2.0
  • Self-hostable: Yes — Docker, Kubernetes
  • K8s: Native support via Kubernetes Operator
  • Maturity: High — been around since 2012, production-proven
  • Key strength: True multi-model in one engine. Graph + document + search without multiple databases. AQL query language is powerful.
  • Key weakness: Jack of all trades — doesn't outperform specialized databases in any single model. Smaller community than Neo4j or Postgres. AQL is its own language (not Cypher, not SQL).
  • Fit for Kaze: Interesting alternative to the "Postgres + extensions" approach. But introduces a non-standard query language and a less familiar database. The multi-model promise is similar to SurrealDB but more mature.

4.4 Hybrid / Multi-Model Options

Postgres + pgvector + Apache AGE (The Stack)

  • What: A single PostgreSQL instance with both extensions, providing relational + vector + graph capabilities.
  • How it works: Standard SQL for relational data, pgvector for similarity search, Apache AGE for Cypher graph queries. All data lives in one database, queryable with SQL.
  • Strengths:
    • One database to operate, backup, monitor, and scale
    • Already in our stack (we committed to Postgres)
    • Battle-tested foundation (Postgres)
    • CloudNativePG operator for Kubernetes
    • Can join across relational, vector, and graph data
  • Weaknesses:
    • Neither the best vector DB nor the best graph DB — adequate at both, optimal at neither
    • Apache AGE is less mature than Neo4j's Cypher implementation
    • pgvector performance ceiling is lower than Qdrant's
    • Complex queries that span all three models may be slow
  • Fit for Kaze: The pragmatic default. Minimum operational burden. Good enough for MVP and early scale. Clear upgrade paths when specific capabilities need more performance.

SurrealDB

  • What: A Rust-native, multi-model database unifying documents, graphs, vectors, full-text search, time-series, and relational data in one engine. Explicitly marketing as "the multi-model database for AI agents."
  • Source: https://github.com/surrealdb/surrealdb
  • License: Business Source License 1.1 (transitions to Apache 2.0 after 4 years)
  • Self-hostable: Yes — single binary, Docker, Kubernetes
  • K8s: Supported
  • Maturity: Lower — version 3.0 launched recently with $23M funding. Growing rapidly but significantly less battle-tested than Postgres.
  • Key features:
    • Single engine replaces multiple databases
    • Built-in HNSW vector indexing
    • Graph traversal + vector search + relational queries all transactional
    • Built-in access control (row-level security, RBAC)
    • Real-time change feeds
    • SurrealQL (its own query language)
  • Strengths:
    • Replaces Postgres + pgvector + Apache AGE + potentially more with one system
    • Built-in access control is relevant for our multi-tenant needs
    • Designed specifically for AI agent workloads
    • Single system to operate
  • Weaknesses:
    • Young — v3.0 just launched, significantly less battle-tested than Postgres
    • BSL license (not truly open source until 4-year transition)
    • SurrealQL is another query language to learn (not SQL, not Cypher)
    • Smaller community and ecosystem
    • Risk: if SurrealDB pivots, fails, or changes licensing, we're dependent
    • Performance benchmarks at scale are limited
  • Fit for Kaze: The bold option. If it delivers on its promises, it's exactly what we need — one database for everything, with built-in access control. But the maturity risk is real. Worth watching and potentially evaluating for Phase 2 or 3, but risky as a day-1 foundation.

5. Architecture Options for Kaze

5.1 Option A: Postgres-Centric (Conservative)

┌──────────────────────────────────────────────┐
│                PostgreSQL                      │
│                                                │
│  pgvector ─── semantic retrieval (embeddings) │
│  Apache AGE ─ knowledge graph (entities +     │
│               relationships as Cypher graphs)  │
│  Standard ─── episodic logs, agent state,     │
│  tables       audit trail, client data         │
└──────────────────────────────────────────────┘

+ Cognee or GraphRAG ── knowledge graph construction pipeline
+ Mem0 ────────────────── per-agent memory layer
DimensionAssessment
Operational complexityLow — one database
Performance ceilingMedium — adequate for MVP, will need upgrades at scale
Maturity / riskVery low — Postgres is battle-tested
Cloud-agnosticYes — Postgres runs everywhere
K8s nativeYes — CloudNativePG
CostLow — one system to operate
Upgrade pathSwap pgvector → Qdrant, swap AGE → FalkorDB when needed

Best for: MVP through early-to-mid scale. Minimize operational burden while proving the product.

5.2 Option B: Postgres + Dedicated Graph DB (Balanced)

┌──────────────────────────────────┐  ┌──────────────────┐
│           PostgreSQL              │  │    FalkorDB      │
│                                    │  │                  │
│  pgvector ── semantic retrieval   │  │  Knowledge graph │
│  Standard ── episodic logs,       │  │  (entities,      │
│  tables      agent state, audit   │  │   relationships, │
│                                    │  │   traversals)    │
└──────────────────────────────────┘  └──────────────────┘

+ Cognee or GraphRAG ── knowledge graph construction pipeline
+ Mem0 ────────────────── per-agent memory layer
DimensionAssessment
Operational complexityMedium — two databases to manage
Performance ceilingHigh — dedicated graph DB handles complex traversals well
Maturity / riskLow-medium — Postgres is solid, FalkorDB is newer but performant
Cloud-agnosticYes — both run anywhere
K8s nativeYes — both have K8s support
CostMedium — two systems to operate
Upgrade pathSwap pgvector → Qdrant if vector search needs scaling

Best for: When the knowledge graph becomes a core differentiator and needs high-performance graph queries (complex multi-hop reasoning across verticals).

5.3 Option C: SurrealDB (Bold)

┌──────────────────────────────────────────────┐
│                SurrealDB                       │
│                                                │
│  Vector index ── semantic retrieval            │
│  Graph model ─── knowledge graph               │
│  Document model ─ episodic logs, agent state   │
│  Built-in RBAC ─ access control                │
│  Change feeds ── real-time event streaming     │
└──────────────────────────────────────────────┘

+ Cognee or GraphRAG ── knowledge graph construction pipeline
+ Mem0 ────────────────── per-agent memory layer
DimensionAssessment
Operational complexityLow — one system
Performance ceilingUnknown — insufficient benchmarks at scale
Maturity / riskHigh risk — v3.0 just launched, BSL license
Cloud-agnosticYes — single binary
K8s nativeYes
CostLow — one system
Upgrade pathHarder — SurrealQL is proprietary, migration away is more work

Best for: Teams willing to bet on a newer technology for architectural elegance. Higher risk, potentially higher reward.

5.4 Option D: Build Custom Knowledge Layer (Full Control)

┌──────────────────────────────────────────────┐
│           PostgreSQL + pgvector + AGE          │
│           (storage foundation)                 │
└──────────────────────┬───────────────────────┘

┌──────────────────────┴───────────────────────┐
│      Custom Kaze Knowledge Layer (TypeScript)  │
│                                                │
│  Memory type routing (MIRIX-inspired)          │
│  Tri-factor retrieval (Generative Agents)      │
│  Graph traversal + spreading activation        │
│  Git-like versioning (Letta-inspired)          │
│  Access control (Collaborative Memory)         │
│  Provenance tracking (AriGraph-inspired)       │
│  Consolidation pipelines                       │
│  Quality gates for shared knowledge            │
└──────────────────────────────────────────────┘
DimensionAssessment
Operational complexityLow (infra) + High (development)
Performance ceilingAs high as you build it
Maturity / riskLow infra risk, high development risk (building it takes time)
DifferentiationHighest — the knowledge system IS the product moat
CostHigh development investment
Time to MVPLongest

Best for: When the knowledge system is a core competitive advantage worth investing heavily in. Long-term optimal but slow to start.

5.5 Hybrid Recommendation: Start A, Evolve to D

The pragmatic path combines the speed of Option A with the long-term vision of Option D:

Phase 1 (MVP): Option A — Postgres-centric + Mem0 + Cognee/GraphRAG

  • Get agents running with working memory fast
  • Use Mem0 for per-agent memory (don't build from scratch)
  • Use Cognee/GraphRAG to bootstrap the first vertical's knowledge graph
  • Store everything in Postgres (pgvector + Apache AGE)

Phase 2 (Product-Market Fit): Start building Option D's custom layer

  • Build the Kaze Knowledge Layer incrementally on top of Postgres
  • Implement typed memory (episodic/semantic/procedural) routing
  • Add provenance tracking and versioning
  • Add access control and tenant isolation
  • Gradually replace Mem0 with our own memory management as we understand the patterns better

Phase 3 (Scale): Optimize the storage layer

  • If graph queries need more performance → evaluate FalkorDB (Option B)
  • If vector search needs more throughput → add Qdrant
  • If we want to consolidate → evaluate SurrealDB (Option C) as a potential replacement

This approach de-risks the MVP while preserving optionality for the future.


6. Recommendation

6.1 Proposed Knowledge System Architecture

┌─────────────────────────────────────────────────────────┐
│              KAZE KNOWLEDGE SYSTEM                        │
│                                                           │
│  ┌─────────────────────────────────────────────────────┐ │
│  │          KNOWLEDGE LAYER (TypeScript)                │ │
│  │                                                      │ │
│  │  ┌──────────────────────────────────────────────┐   │ │
│  │  │           Memory Type Router                  │   │ │
│  │  │     (inspired by MIRIX Meta Manager)          │   │ │
│  │  └──────┬──────┬──────────┬──────────┬──────────┘   │ │
│  │         │      │          │          │               │ │
│  │    ┌────┴──┐┌──┴────┐┌───┴────┐┌────┴─────┐        │ │
│  │    │Episod.││Semant.││Proced. ││Reflect.  │        │ │
│  │    │Memory ││Memory ││Memory  ││Memory    │        │ │
│  │    │       ││       ││        ││          │        │ │
│  │    │Events,││Facts, ││Skills, ││Insights, │        │ │
│  │    │logs,  ││graph, ││how-to, ││learnings │        │ │
│  │    │history││rels   ││code    ││          │        │ │
│  │    └───────┘└───────┘└────────┘└──────────┘        │ │
│  │                                                      │ │
│  │  ┌──────────────────────────────────────────────┐   │ │
│  │  │    Retrieval Engine                           │   │ │
│  │  │    - Tri-factor scoring (baseline)            │   │ │
│  │  │    - Graph traversal (knowledge graph)        │   │ │
│  │  │    - Spreading activation (linked notes)      │   │ │
│  │  │    - Agent-initiated search (tool calls)      │   │ │
│  │  └──────────────────────────────────────────────┘   │ │
│  │                                                      │ │
│  │  ┌──────────────────────────────────────────────┐   │ │
│  │  │    Write Pipeline                             │   │ │
│  │  │    - Provenance tagging (AriGraph-inspired)   │   │ │
│  │  │    - Version control (Letta-inspired)         │   │ │
│  │  │    - Quality gate (Voyager self-verification) │   │ │
│  │  │    - Access control (Collaborative Memory)    │   │ │
│  │  └──────────────────────────────────────────────┘   │ │
│  │                                                      │ │
│  │  ┌──────────────────────────────────────────────┐   │ │
│  │  │    Consolidation Engine                       │   │ │
│  │  │    - Episodic → Semantic distillation         │   │ │
│  │  │    - Reflection synthesis                     │   │ │
│  │  │    - Contradiction detection                  │   │ │
│  │  │    - Importance-based retention               │   │ │
│  │  └──────────────────────────────────────────────┘   │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                           │
│  ┌─────────────────────────────────────────────────────┐ │
│  │          STORAGE LAYER                               │ │
│  │                                                      │ │
│  │  Phase 1: PostgreSQL + pgvector + Apache AGE         │ │
│  │  Phase 2: + Qdrant (if vector scale needed)          │ │
│  │  Phase 3: + FalkorDB (if graph scale needed)         │ │
│  │           or SurrealDB (if consolidation desired)    │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                           │
│  ┌─────────────────────────────────────────────────────┐ │
│  │     CONSTRUCTION PIPELINE                            │ │
│  │                                                      │ │
│  │  Cognee ── incremental knowledge graph updates       │ │
│  │  GraphRAG ── bulk document → knowledge graph         │ │
│  │  Custom ── agent experience → knowledge distillation │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                           │
│  ┌─────────────────────────────────────────────────────┐ │
│  │     PER-AGENT MEMORY                                 │ │
│  │                                                      │ │
│  │  Mem0 ── working + episodic memory per agent         │ │
│  │  (Phase 2: evaluate building custom replacement)     │ │
│  └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘

6.2 Key Design Decisions

DecisionChoiceRationale
Memory typesEpisodic + Semantic + Procedural + ReflectiveAligned with CoALA framework and MIRIX. Proven taxonomy.
Storage (Phase 1)Postgres + pgvector + Apache AGEOne DB, already in stack, good enough for MVP, clear upgrade paths
Per-agent memoryMem0Production-proven, handles agent-level memory well, saves dev time
KG constructionCognee + GraphRAGCognee for incremental, GraphRAG for bulk. Both open source.
Retrieval strategyTri-factor + graph traversal + agent-initiatedLayered approach covering baseline ranking, structured traversal, and agent autonomy
Versioning modelGit-inspired (Letta pattern)Every knowledge write is a versioned commit with provenance
Access controlPrivate/Shared tiers with ABACCollaborative Memory pattern. Enables client isolation + vertical sharing.
Quality gatesVerification before shared knowledge entryVoyager pattern. Prevents low-quality knowledge from polluting the shared store.

6.3 What We're Intentionally Deferring

  • Parametric memory (fine-tuning): Not for MVP. Too expensive and complex. Revisit when we have enough vertical data to warrant it.
  • Learned retrieval weights: Start with fixed tri-factor weights. Learn them from data later when we have enough usage signal.
  • Billion-scale vector search: pgvector handles early scale. Qdrant or Milvus when needed.
  • Real-time cross-cell knowledge sync: Phase 3 mesh feature. Start with single-cell knowledge.

7. References

Academic Papers

PaperAuthorsYearKey ContributionLink
Generative AgentsPark et al.2023Memory stream + tri-factor retrieval + reflectionarXiv
MemGPTPacker et al.2023OS-inspired tiered memory, virtual context managementarXiv
VoyagerWang et al.2023Skill library as procedural memoryarXiv
ReflexionShinn et al.2023Verbal reinforcement learning, reflective memoryarXiv
CoALASumers, Yao et al.2023Unifying cognitive architecture taxonomyarXiv
AriGraphAnokhin et al.2024Knowledge graph + episodic memory with provenancearXiv
A-MEMXu et al.2025Zettelkasten-inspired agentic memoryarXiv
Collaborative MemoryRezazadeh et al.2025Multi-user shared memory with access controlarXiv
MIRIXWang & Chen2025Six-component modular memory systemarXiv
Memory in the Age of AI AgentsLiu et al.2025Comprehensive survey of agent memoryarXiv
Mem0Chadha et al.2025Production-ready scalable agent memoryarXiv

Open Source Tools

ToolRepositoryLicense
Mem0github.com/mem0ai/mem0Apache 2.0
Lettagithub.com/letta-ai/lettaApache 2.0
Cogneegithub.com/topoteretes/cogneeApache 2.0
Microsoft GraphRAGgithub.com/microsoft/graphragMIT
pgvectorgithub.com/pgvector/pgvectorPostgreSQL License
Qdrantgithub.com/qdrant/qdrantApache 2.0
Weaviategithub.com/weaviate/weaviateBSD-3-Clause
Apache AGEgithub.com/apache/ageApache 2.0
FalkorDBgithub.com/FalkorDB/FalkorDBSSPL
Neo4jgithub.com/neo4j/neo4jGPLv3 / Commercial
SurrealDBgithub.com/surrealdb/surrealdbBSL 1.1

Additional Resources