Skip to content

Performance Scalability Model

Research for Project Kaze


1. Scale Milestones

Four stages of growth, each with different architectural requirements.

Stage 0: MVP (Speedrun-only)

1 cell · 3 verticals · ~15 agents · 1 Postgres · Direct calls · Single K8s cluster
  • All components run in one namespace on one cluster.
  • Single Postgres instance handles everything (agent state, knowledge, observations, schedules, budgets).
  • Inter-agent messaging via direct function calls (in-process).
  • LLM calls through 1-2 provider API keys per provider.
  • Monitoring stack is minimal (local Prometheus + Grafana).
  • Bottleneck: None expected — this is well within single-node capacity.

Stage 1: Early Clients (5-10 clients)

1-3 shared cells · 3-5 verticals · ~50 agents · Shared Postgres with connection pooling · Direct calls still viable
  • Shared cells with namespace isolation for cost efficiency.
  • Connection pooling (PgBouncer) becomes necessary — 50 agents × multiple connections per agent.
  • Observation Logger write volume starts growing — implement table partitioning.
  • LLM provider rate limits become relevant — need key pooling (multiple keys per provider).
  • Vault access patterns increase — implement Vault response caching in LLM Gateway.
  • First bottleneck: LLM provider rate limits and Postgres connection count.

Stage 2: Growth (20-50 clients)

5-15 cells (mix shared + dedicated) · 5+ verticals · ~200 agents · Read replicas · NATS migration · Multiple K8s clusters possible
  • Dedicated cells for large/sensitive clients, shared cells for small clients.
  • Postgres read replicas for knowledge queries and observation reads.
  • NATS introduced for inter-agent messaging (multi-node deployment makes direct calls impractical).
  • pgvector index size growing — evaluate Qdrant for hot-path vector queries.
  • Observation storage partitioned aggressively (by month + tenant).
  • Multiple LLM provider accounts with key rotation.
  • Monitoring scales: Prometheus federation or Thanos for cross-cell aggregation.
  • First bottleneck: pgvector query latency, Postgres write contention on hot tables, inter-agent communication across nodes.

Stage 3: Scale (100+ clients)

30+ cells · Multi-region · ~1000+ agents · Sharded databases · NATS clusters · Customer VPC deployments
  • Multi-region deployment with cell placement based on client geography.
  • Per-component database split: knowledge system gets its own Postgres, observations get its own, agent state gets its own.
  • NATS superclusters for cross-region messaging.
  • Qdrant as dedicated vector DB alongside Postgres (pgvector still for cold-path, Qdrant for hot-path).
  • Horizontal scaling of LLM Gateway as a standalone service with distributed rate limiting.
  • Customer VPC deployments operating independently.
  • Federated monitoring with Thanos/Mimir.
  • First bottleneck: Operational complexity, cross-region latency, VPC deployment automation.

2. Component Bottleneck Analysis

2.1 Agent Runtime

MetricWhat limits itExpected ceilingMitigation
Concurrent agents per nodeMemory — each agent holds context, loaded skill definitions, active task state~50-100 agents per 8GB node (est. 50-150MB per agent with loaded context)HPA on agent runtime pods, distribute across nodes
Task throughputLLM call latency — most tasks are LLM-bound, not compute-boundLimited by LLM Gateway throughput, not runtime itselfParallelize independent subtasks, pipeline LLM calls
Subagent fan-outMemory + LLM concurrency — each subagent is a new agent instanceDepth limit (3-4 levels) + breadth limit (5-10 per parent) enforced by capability manifestHard limits in runtime config, parent budget shared
OpenClaw subprocessProcess count — OpenClaw spawns LLM backends as child processesOS limits (~1000 processes), memory per processProcess pooling, shared backend instances across agents
Skill definition loadingDisk I/O + YAML parsing on agent spawnNegligible — YAML files are smallCache parsed definitions in memory after first load

Key insight: Agent Runtime is almost never the bottleneck. Agents spend most of their time waiting on LLM calls, knowledge queries, and tool responses. The runtime itself is lightweight orchestration.

2.2 LLM Gateway

MetricWhat limits itExpected ceilingMitigation
Provider rate limitsProvider-imposed (tokens/min, requests/min)Anthropic: ~4M tokens/min (Tier 4), OpenAI: ~10M tokens/min (Tier 5) — varies by planMulti-key pooling, multi-provider fallback, request queuing with priority
Concurrent in-flight requestsConnection pool to providers + memory for streaming responses~100-500 concurrent (depends on average response time ~5-30s)Connection pooling, streaming response relay (don't buffer full response)
Budget tracking writesFOR UPDATE SKIP LOCKED contention on budget rowsContention at ~50+ concurrent agents for same tenantPre-request estimate is a read (cached), post-request update is the write. Batch budget updates every N seconds instead of per-request
Key resolution latencyVault lookup per request~10-50ms per Vault callCache Vault responses with TTL (60s). Key doesn't change often.
Model hint resolutionTenant config lookupNegligible — config cached in memoryReload on config change event

Key insight: LLM provider rate limits are the hard ceiling. Everything else in the gateway can be scaled horizontally. The strategy is: maximize tokens processed per dollar per second across all available providers and keys.

Multi-key pooling:

Provider: Anthropic
  Key pool:
    speedrun-key-1 (Tier 4): 4M tokens/min
    speedrun-key-2 (Tier 2): 1M tokens/min
    client-a-key (Tier 3):   2M tokens/min (for Client A only)

  Total Anthropic capacity: 7M tokens/min

  Routing: round-robin across eligible keys per request
  (eligible = keys the requesting tenant is allowed to use)

2.3 Knowledge System (Mem0 + pgvector)

MetricWhat limits itExpected ceilingMitigation
pgvector query latencyHNSW index size in memory<50ms at ~1M vectors (1536-dim), degrades above ~5MHNSW ef_search tuning, add Qdrant at >5M vectors
Concurrent vector searchesPostgres connection pool + CPU for ANN search~20-50 concurrent searches per Postgres instanceRead replicas for knowledge reads, connection pooling
Embedding generationLLM/embedding API call per knowledge writeSerialized per write — ~50-200ms per embeddingBatch embedding generation, async write pipeline (embed in background, index when ready)
Mem0 instance memoryPer-agent episodic memory storageGrows with conversation length — compact old episodesMem0's built-in compaction, configure max memory window
Version history bloatEvery knowledge write creates a version entryLinear growth — manageable for yearsCompact old versions (keep latest N per entry), archive to cold storage
Index rebuild timeFull HNSW rebuild when adding vectorsMinutes at 1M vectors, hours at 10M+Incremental index updates (pgvector supports this), or Qdrant which handles online indexing

Key insight: pgvector is the first knowledge bottleneck. It's excellent up to ~5M vectors but degrades beyond that. The mitigation path is clear: add Qdrant for hot-path queries (agent reasoning) while keeping pgvector for cold-path (batch analytics, quality gate evaluation).

Scaling path:

Stage 0-1: pgvector only
  ↓ trigger: p95 query latency >200ms OR index >5M vectors
Stage 2: pgvector + Qdrant
  Qdrant: hot-path reads during agent reasoning
  pgvector: writes (single source of truth), cold reads, analytics
  ↓ trigger: write throughput >1000/min sustained
Stage 3: Qdrant primary for reads, Postgres for relational + writes
  Async replication: Postgres → Qdrant via CDC

2.4 Tool Integration Framework

MetricWhat limits itExpected ceilingMitigation
External API rate limitsPer-provider (SEMrush: 10 req/sec, GitHub: 5000/hr, Google: varies)Varies widely per tool providerPer-tool rate limiter in framework, queue with backoff, cache repeated queries
Concurrent outbound connectionsOS socket limits, K8s network policy overhead~1000 concurrent per pod (OS default)Connection pooling per external service
Vault credential resolutionSame as LLM Gateway — Vault lookup latency~10-50ms per callCache with TTL, pre-warm credentials on agent spawn
Tool response parsingCPU for JSON/XML parsingNegligible unless tool returns massive payloadsSet response size limits per tool definition

Key insight: Tool Framework scaling is dominated by external API limits, not internal capacity. The framework itself is lightweight. The main architectural concern is: don't let one agent's tool calls starve another's. Per-tenant, per-tool rate limiting is essential.

2.5 Task Scheduler

MetricWhat limits itExpected ceilingMitigation
Schedule densityPolling query: WHERE next_run_at <= nowEfficient with index — 10K+ schedules is fineB-tree index on next_run_at, no full table scan
Polling contention (HA)FOR UPDATE SKIP LOCKED across replicasMinimal — SKIP LOCKED avoids blockingN replicas naturally partition work
Event trigger throughputDirect callback in MVP, NATS in Phase 2~1000 events/sec is fine for direct callsMove to NATS when event volume exceeds in-process capacity
Missed schedule catch-upOn restart, scan for overdue schedulesOnly scans 1hr lookback — boundedIndex on next_run_at makes this fast

Key insight: Task Scheduler is the least likely bottleneck. It's a simple cron-like system with well-understood scaling characteristics. FOR UPDATE SKIP LOCKED is the proven pattern for distributed job scheduling in Postgres.

2.6 Observation Logger

MetricWhat limits itExpected ceilingMitigation
Write volumeEvents per second from all components~100-500 events/sec at Stage 1, ~5K-10K at Stage 3Batched writes (already designed: 100 events or 1s flush)
Storage growth~1KB per event average~30GB/month at 10K events/secMonthly partitioning, retention policy (drop partitions >6 months, archive to object storage)
Query performanceFull-text search across months of dataDegrades on unpartitioned tablesPartition by month + tenant, indexed by task_id, agent_id, timestamp
Batch flush under back-pressureDatabase write latency spike → buffer fillsBuffer limit reached → drop oldest debug events firstAlready designed: graceful degradation, never blocks agent execution

Key insight: Observation Logger volume scales linearly with agent count and task frequency. The batched write design handles this well. The main concern is query performance on historical data — partitioning is the answer, not a bigger database.


3. LLM Provider Throughput

3.1 Provider Rate Limits (Approximate, subject to plan tier)

ProviderRequests/minTokens/min (input)Tokens/min (output)Concurrent
Anthropic (Tier 4)~4,000~400K~80K~—
OpenAI (Tier 5)~10,000~10M~2M~—
Google Gemini (Pay-as-you-go)~360~4M~200K~—
Ollama (local)Hardware-boundHardware-boundHardware-bound1 per GPU

These are rough estimates — actual limits depend on specific models, account tier, and provider policy at the time.

3.2 What This Means for Agent Throughput

A typical agent task involves:

  • 1-3 LLM calls for reasoning (each ~2K input, ~1K output tokens)
  • 0-2 tool calls (which may trigger additional LLM calls for parsing)
  • 1 knowledge query (may trigger embedding generation)
  • Total: ~3-8 LLM calls per task, ~10-20K tokens per task

Throughput estimates per provider (single key):

ProviderTasks/min (est.)Tasks/hour (est.)
Anthropic (Tier 4)~500-1300~30K-80K
OpenAI (Tier 5)~1200-3300~75K-200K
Google Gemini~45-120~3K-7K

With multi-key pooling (3 keys per provider) and multi-provider routing, the system can handle ~5K-10K tasks/hour — well beyond Stage 2 requirements.

3.3 Rate Limit Management

Request comes in from Agent
  → LLM Gateway checks provider's token bucket
  → If tokens available: send immediately
  → If bucket empty but another provider eligible: route to alternative
  → If all buckets empty: queue with priority (high-priority tasks first)
  → If queue depth > threshold: reject lowest-priority requests

Priority levels:

  1. Supervised tasks (human waiting for output) — highest
  2. Active conversation responses — high
  3. Scheduled background tasks — normal
  4. Knowledge consolidation — low
  5. Quality evaluation — lowest

3.4 Local Model Overflow

For non-critical tasks (knowledge consolidation, quality evaluation, embedding generation), local models via Ollama/vLLM can absorb overflow:

  • No rate limits (hardware-bound only)
  • No token costs
  • Higher latency, potentially lower quality
  • Good for: embedding generation, simple classification, data extraction
  • Not suitable for: complex reasoning, client-facing outputs

Trigger: Local model overflow activates when cloud provider queue depth exceeds threshold AND the task's model hint is fast or embed.


4. Knowledge System at Scale

4.1 pgvector Performance Characteristics

Based on published benchmarks and community data:

Vector countDimensionsIndex typeQuery latency (p95)Memory for index
100K1536HNSW~5ms~600MB
1M1536HNSW~20-50ms~6GB
5M1536HNSW~100-200ms~30GB
10M+1536HNSW~300ms+~60GB+

HNSW tuning parameters:

  • m (connections per node): higher = better recall, more memory. Default 16, increase to 32-64 for better recall.
  • ef_construction (build quality): higher = slower build, better index quality. Default 64, increase to 128-256.
  • ef_search (query quality): higher = better recall, slower queries. Tune dynamically based on latency budget.

4.2 Growth Model

Estimated knowledge entries over time:

StageAgentsEntries/agent/dayTotal entries/dayAfter 1 year
Stage 015~20~300~100K
Stage 150~20~1,000~400K
Stage 2200~20~4,000~1.5M
Stage 31000~20~20,000~7M

At ~20 entries per agent per day (episodic events, observations, learned facts), pgvector stays performant through Stage 2. Stage 3 triggers the Qdrant addition.

4.3 Embedding Generation Throughput

Every knowledge write needs an embedding. At scale:

StageWrites/minEmbedding API calls/minLatency impact
Stage 0~0.2~0.2None
Stage 1~0.7~0.7None
Stage 2~3~3Negligible
Stage 3~14~14Queue if burst

Embedding generation is not a bottleneck for knowledge writes. The real concern is embedding for knowledge queries — each semantic search needs to embed the query text. At Stage 2-3 with many concurrent agents querying simultaneously, embedding the query becomes a latency factor.

Mitigation: Cache recent query embeddings (same query text = same embedding). Batch embedding requests where possible. Use embed model hint to route to fastest/cheapest embedding model.

4.4 Mem0 Scaling

Two options for Mem0 at scale:

Option A: Shared Mem0 instance with tenant/agent routing

  • Pro: Simpler ops, single instance to manage
  • Con: Noisy neighbor risk, single point of failure
  • Scaling: Vertical (bigger instance) → horizontal (Mem0 clustering if supported)

Option B: Mem0 instance per cell

  • Pro: Natural tenant isolation, failure isolation
  • Con: More instances to manage, higher base resource cost
  • Scaling: Cell-level scaling (each cell manages its own Mem0)

Recommendation: Option B (per-cell Mem0) aligns with the cell architecture. Each cell is self-contained. Mem0 instances are lightweight and don't need cross-cell communication.


5. Database Scaling Strategy

5.1 Connection Management

Agent count drives connection demand:

Per agent: ~3-5 connections (knowledge, observation, state, tools, scheduler)
Per component: ~2-3 shared connections for internal operations
Postgres default: 100 connections

Stage 0: 15 agents × 4 = ~60 connections → fine
Stage 1: 50 agents × 4 = ~200 connections → needs PgBouncer
Stage 2: 200 agents × 4 = ~800 connections → PgBouncer + read replicas
Stage 3: 1000 agents across cells → per-cell databases

PgBouncer config:

  • Mode: transaction (release connection after each transaction)
  • Pool size: 50-100 per cell (most agents are waiting on LLM, not DB)
  • Reserve: 5 connections for admin/monitoring

5.2 Read Replica Strategy

Query typeRouteWhen to add replica
Knowledge queries (vector search)Read replicaStage 2 (>200 agents, concurrent reads)
Observation queries (debugging, traces)Read replicaStage 1 (query volume grows with event count)
Budget checks (pre-request)Read replica (slight lag acceptable)Stage 2
Budget updates (post-request)PrimaryAlways primary (writes)
Schedule pollingPrimaryAlways primary (FOR UPDATE)
Agent state writesPrimaryAlways primary

5.3 Partitioning Strategy

TablePartition keyStrategyWhen
observationscreated_at (monthly)Range partition by month. Drop partitions >6 months, archive to object storage.Stage 1
memoriestenant_idList partition by tenant. Each tenant's knowledge is physically separated.Stage 2
llm_usage_logcreated_at (monthly)Range partition. Retention: 3 months hot, archive older.Stage 1
schedule_executionscreated_at (monthly)Range partition. Low volume but grows indefinitely.Stage 2

5.4 Database Split Points

Eventually, a single Postgres can't serve all components optimally:

TriggerAction
Observation writes impacting agent state latencySplit observation tables to dedicated Postgres instance
Knowledge queries impacting other workloadsSplit knowledge tables to dedicated instance (+ Qdrant)
Per-cell database makes more sense than sharedEach cell gets its own Postgres (aligns with cell architecture)

Likely split order: Observations first (highest write volume, read patterns are different from transactional queries), then knowledge (vector search is CPU-intensive and benefits from dedicated resources).


6. Messaging & Communication Scaling

6.1 Direct Calls (MVP — Stage 0-1)

Agent A calls Agent B: runtime.dispatch(agentB, task)
  → In-process function call via DirectCallTransport
  → Zero serialization, zero network hop
  → Limited to single process / single node

Advantages: Zero latency, no infrastructure, simple debugging. Ceiling: All agents must run in the same process. Can't distribute across nodes.

6.2 NATS Migration Trigger

Migrate from direct calls to NATS when:

  • Agent runtime spans multiple nodes (can't do in-process calls cross-node)
  • Need for persistent message queues (agent crashes, message survives)
  • Event-driven triggers need decoupling (scheduler → agent without direct reference)
  • Inter-cell communication begins (Phase 2+)

Expected trigger: Stage 2, when shared cells host enough agents to require multi-node deployment.

6.3 NATS Sizing

StageAgentsMessages/sec (est.)NATS cluster
Stage 2~200~50-2003-node cluster, 10GB JetStream storage
Stage 3~1000~500-20005-node cluster per region, 100GB JetStream

NATS is lightweight — a 3-node cluster handles millions of messages/sec. The bottleneck is never NATS throughput; it's consumer processing speed. Back-pressure management (NATS consumer acknowledgment + redelivery) handles slow consumers.

6.4 Message Patterns

PatternUse caseVolume
Request/ReplyAgent A asks Agent B to do a task, waits for resultLow-medium (~1-5 per task)
PublishEvents (task completed, schedule fired, budget warning)Medium (~10-50 per task)
Queue GroupLoad-balanced consumption (multiple Agent Runtime instances pick up tasks)Medium
JetStreamDurable event log for observation replay, auditHigh (all observation events)

7. Infrastructure Scaling Patterns

7.1 Kubernetes Scaling

Agent Runtime pods — HPA config:

  • Scale metric: agent count per pod (not CPU — agents are IO-bound, not CPU-bound)
  • Target: 30-50 agents per pod
  • Min replicas: 2 (HA)
  • Max replicas: per-cell limit based on tenant plan

Namespace resource quotas (per-tenant in shared cell):

  • CPU limit: prevents noisy neighbor
  • Memory limit: prevents OOM from runaway agent memory
  • Pod count limit: caps agent + subagent sprawl
  • PVC limit: caps storage claims

Node pool strategy:

  • General pool: platform components (Gateway, Scheduler, Logger)
  • Agent pool: Agent Runtime pods (potentially burstable/spot instances)
  • Database pool: Postgres, Qdrant (SSD-backed, consistent performance)

7.2 Monitoring Stack Scaling

ComponentScaling concernMitigation
PrometheusMetric cardinality explosion (per-agent, per-tenant labels)Label discipline, recording rules for pre-aggregation, drop high-cardinality debug metrics
Prometheus retentionStorage at high scrape frequency15-day local retention, Thanos/Mimir for long-term storage
LokiLog volume from agent executionStructured logging (not free-text), log level filtering, retention policy
GrafanaDashboard load with many tenantsTenant-scoped dashboards, variable-based queries (not one dashboard per tenant)

Stage 2+ (multi-cell): Prometheus federation or Thanos for cross-cell metric aggregation. Each cell runs its own Prometheus; a central Thanos query layer provides the global view.

7.3 Service Mesh Overhead

mTLS everywhere (zero-trust) adds latency:

  • Linkerd: ~1ms p99 overhead per hop. Lightweight sidecar (~20MB memory per pod).
  • Cilium: eBPF-based, even lower overhead (~0.5ms). No sidecar (kernel-level).

At Kaze's scale, mesh overhead is negligible compared to LLM call latency (~5-30 seconds). The 1ms per hop is lost in the noise.

Recommendation: Start with Cilium if the K8s environment supports eBPF. Otherwise Linkerd. Don't over-optimize — mesh overhead is not a scaling concern for Kaze.


8. Scaling Decision Triggers

Concrete metrics and thresholds that trigger scaling actions. Thresholds are initial estimates — calibrate from actual MVP measurements.

Trigger MetricThresholdActionStage
Postgres connection count>80% of pool maxDeploy PgBouncer (transaction mode)1
Concurrent agents per node>50HPA scales agent runtime pods1
pgvector query latencyp95 >200msAdd read replica. If still >200ms: evaluate Qdrant.2
LLM Gateway request queue depth>100 pending for >30sAdd provider API keys to pool. If maxed: enable local model overflow.1-2
Observation write batch flush timep95 >50msIncrease batch size. Add monthly partitioning if not already.1
Agent memory per instancep90 >512MBReview knowledge preload strategy. Reduce context window. Run Mem0 compaction.2
Inter-agent message latency (direct calls)p95 >100ms OR multi-node neededMigrate from DirectCallTransport to NatsTransport.2
Knowledge entry count (pgvector)>5M vectorsAdd Qdrant for hot-path reads. Keep pgvector as source of truth.2-3
Observation table query time (historical)p95 >500ms for tracesPartition by month if not already. Add indexes on task_id + timestamp.1
Budget tracking write contention>10% lock timeoutsBatch budget updates (aggregate per 5s window instead of per-request).2
Postgres total database size>500GB per instanceSplit by component (observations, knowledge, agent state → separate instances).3
LLM provider total throughput>80% of pooled key capacityNegotiate higher tier. Add providers. Shift low-priority to local models.2-3
NATS message backlog>10K unacknowledged per consumerScale consumer pods. If persistent: review consumer processing bottleneck.2-3
Embedding generation queue>100 pending writesBatch embedding calls. Add dedicated embedding service with local model.2-3

Monitoring these triggers:

  • All thresholds should be Prometheus alerts (warning at 70% of threshold, critical at threshold).
  • Dashboard with "scaling readiness" view: each trigger as a gauge showing current value vs threshold.
  • Weekly review in Stage 1-2 to calibrate thresholds based on actual production data.

9. Anti-Patterns to Avoid

Anti-patternWhy it's temptingWhy it's wrongWhat to do instead
Pre-sharding the database"We'll need it eventually"Adds massive complexity before it's needed. Single Postgres handles more than you think.Start with one Postgres, split only when metrics demand it.
Running NATS from day one"We'll migrate eventually anyway"Adds operational overhead, debugging complexity, message ordering concerns — all for 15 agents that run in one process.DirectCallTransport until multi-node is required. Same message interface, swap transport.
Deploying Qdrant alongside pgvector at MVP"Vector search will be slow"pgvector is fast up to ~5M vectors. Two vector stores means two consistency models.pgvector only until query latency triggers the migration.
Per-agent Postgres connection"Each agent needs its own connection"200 agents = 200 connections. Postgres doesn't scale that way.Connection pooling (PgBouncer). Agents share pool connections.
Caching everything"Caching improves performance"Cache invalidation is hard. Knowledge consistency matters. Stale knowledge is worse than slow knowledge.Cache only: Vault responses (TTL), model hint resolution (config change event), query embeddings (immutable).
Horizontal scaling before vertical"Scale out for resilience"Running 10 small instances instead of 2 right-sized ones adds network hops, coordination overhead, and debugging complexity.Scale up first (bigger pods, bigger DB). Scale out when vertical limit is reached.

10. Key Architectural Invariants

Properties that must hold at every scale stage:

  1. Agent code doesn't change when infrastructure scales. An agent written at Stage 0 runs unmodified at Stage 3. All scaling happens below the Agent Runtime interface.
  2. Tenant isolation doesn't weaken at scale. Shared cells at Stage 1 have the same tenant isolation as dedicated cells at Stage 3. Namespace boundaries, network policies, and database scoping are non-negotiable.
  3. Message shape doesn't change when transport changes. AgentMessage envelope is the same whether sent via DirectCallTransport or NatsTransport. Agent code never knows the difference.
  4. Knowledge query interface doesn't change when storage changes. Adding Qdrant or read replicas is a backend concern. KnowledgeClient.query() signature stays the same.
  5. Observation is always fire-and-forget. At no scale should logging block agent execution. If the observer can't keep up, it drops — never blocks.