Performance Scalability Model

Research for Project Kaze

1. Scale Milestones

Four stages of growth, each with different architectural requirements.

Stage 0: MVP (Speedrun-only)

1 cell · 3 verticals · ~15 agents · 1 Postgres · Direct calls · Single K8s cluster

All components run in one namespace on one cluster.
Single Postgres instance handles everything (agent state, knowledge, observations, schedules, budgets).
Inter-agent messaging via direct function calls (in-process).
LLM calls through 1-2 provider API keys per provider.
Monitoring stack is minimal (local Prometheus + Grafana).
Bottleneck: None expected — this is well within single-node capacity.

Stage 1: Early Clients (5-10 clients)

1-3 shared cells · 3-5 verticals · ~50 agents · Shared Postgres with connection pooling · Direct calls still viable

Shared cells with namespace isolation for cost efficiency.
Connection pooling (PgBouncer) becomes necessary — 50 agents × multiple connections per agent.
Observation Logger write volume starts growing — implement table partitioning.
LLM provider rate limits become relevant — need key pooling (multiple keys per provider).
Vault access patterns increase — implement Vault response caching in LLM Gateway.
First bottleneck: LLM provider rate limits and Postgres connection count.

Stage 2: Growth (20-50 clients)

5-15 cells (mix shared + dedicated) · 5+ verticals · ~200 agents · Read replicas · NATS migration · Multiple K8s clusters possible

Dedicated cells for large/sensitive clients, shared cells for small clients.
Postgres read replicas for knowledge queries and observation reads.
NATS introduced for inter-agent messaging (multi-node deployment makes direct calls impractical).
pgvector index size growing — evaluate Qdrant for hot-path vector queries.
Observation storage partitioned aggressively (by month + tenant).
Multiple LLM provider accounts with key rotation.
Monitoring scales: Prometheus federation or Thanos for cross-cell aggregation.
First bottleneck: pgvector query latency, Postgres write contention on hot tables, inter-agent communication across nodes.

Stage 3: Scale (100+ clients)

30+ cells · Multi-region · ~1000+ agents · Sharded databases · NATS clusters · Customer VPC deployments

Multi-region deployment with cell placement based on client geography.
Per-component database split: knowledge system gets its own Postgres, observations get its own, agent state gets its own.
NATS superclusters for cross-region messaging.
Qdrant as dedicated vector DB alongside Postgres (pgvector still for cold-path, Qdrant for hot-path).
Horizontal scaling of LLM Gateway as a standalone service with distributed rate limiting.
Customer VPC deployments operating independently.
Federated monitoring with Thanos/Mimir.
First bottleneck: Operational complexity, cross-region latency, VPC deployment automation.

2. Component Bottleneck Analysis

2.1 Agent Runtime

Metric	What limits it	Expected ceiling	Mitigation
Concurrent agents per node	Memory — each agent holds context, loaded skill definitions, active task state	~50-100 agents per 8GB node (est. 50-150MB per agent with loaded context)	HPA on agent runtime pods, distribute across nodes
Task throughput	LLM call latency — most tasks are LLM-bound, not compute-bound	Limited by LLM Gateway throughput, not runtime itself	Parallelize independent subtasks, pipeline LLM calls
Subagent fan-out	Memory + LLM concurrency — each subagent is a new agent instance	Depth limit (3-4 levels) + breadth limit (5-10 per parent) enforced by capability manifest	Hard limits in runtime config, parent budget shared
OpenClaw subprocess	Process count — OpenClaw spawns LLM backends as child processes	OS limits (~1000 processes), memory per process	Process pooling, shared backend instances across agents
Skill definition loading	Disk I/O + YAML parsing on agent spawn	Negligible — YAML files are small	Cache parsed definitions in memory after first load

Key insight: Agent Runtime is almost never the bottleneck. Agents spend most of their time waiting on LLM calls, knowledge queries, and tool responses. The runtime itself is lightweight orchestration.

2.2 LLM Gateway

Metric	What limits it	Expected ceiling	Mitigation
Provider rate limits	Provider-imposed (tokens/min, requests/min)	Anthropic: ~4M tokens/min (Tier 4), OpenAI: ~10M tokens/min (Tier 5) — varies by plan	Multi-key pooling, multi-provider fallback, request queuing with priority
Concurrent in-flight requests	Connection pool to providers + memory for streaming responses	~100-500 concurrent (depends on average response time ~5-30s)	Connection pooling, streaming response relay (don't buffer full response)
Budget tracking writes	`FOR UPDATE SKIP LOCKED` contention on budget rows	Contention at ~50+ concurrent agents for same tenant	Pre-request estimate is a read (cached), post-request update is the write. Batch budget updates every N seconds instead of per-request
Key resolution latency	Vault lookup per request	~10-50ms per Vault call	Cache Vault responses with TTL (60s). Key doesn't change often.
Model hint resolution	Tenant config lookup	Negligible — config cached in memory	Reload on config change event

Key insight: LLM provider rate limits are the hard ceiling. Everything else in the gateway can be scaled horizontally. The strategy is: maximize tokens processed per dollar per second across all available providers and keys.

Multi-key pooling:

Provider: Anthropic
  Key pool:
    speedrun-key-1 (Tier 4): 4M tokens/min
    speedrun-key-2 (Tier 2): 1M tokens/min
    client-a-key (Tier 3):   2M tokens/min (for Client A only)

  Total Anthropic capacity: 7M tokens/min

  Routing: round-robin across eligible keys per request
  (eligible = keys the requesting tenant is allowed to use)

2.3 Knowledge System (Mem0 + pgvector)

Metric	What limits it	Expected ceiling	Mitigation
pgvector query latency	HNSW index size in memory	<50ms at ~1M vectors (1536-dim), degrades above ~5M	HNSW `ef_search` tuning, add Qdrant at >5M vectors
Concurrent vector searches	Postgres connection pool + CPU for ANN search	~20-50 concurrent searches per Postgres instance	Read replicas for knowledge reads, connection pooling
Embedding generation	LLM/embedding API call per knowledge write	Serialized per write — ~50-200ms per embedding	Batch embedding generation, async write pipeline (embed in background, index when ready)
Mem0 instance memory	Per-agent episodic memory storage	Grows with conversation length — compact old episodes	Mem0's built-in compaction, configure max memory window
Version history bloat	Every knowledge write creates a version entry	Linear growth — manageable for years	Compact old versions (keep latest N per entry), archive to cold storage
Index rebuild time	Full HNSW rebuild when adding vectors	Minutes at 1M vectors, hours at 10M+	Incremental index updates (pgvector supports this), or Qdrant which handles online indexing

Key insight: pgvector is the first knowledge bottleneck. It's excellent up to ~5M vectors but degrades beyond that. The mitigation path is clear: add Qdrant for hot-path queries (agent reasoning) while keeping pgvector for cold-path (batch analytics, quality gate evaluation).

Scaling path:

Stage 0-1: pgvector only
  ↓ trigger: p95 query latency >200ms OR index >5M vectors
Stage 2: pgvector + Qdrant
  Qdrant: hot-path reads during agent reasoning
  pgvector: writes (single source of truth), cold reads, analytics
  ↓ trigger: write throughput >1000/min sustained
Stage 3: Qdrant primary for reads, Postgres for relational + writes
  Async replication: Postgres → Qdrant via CDC

2.4 Tool Integration Framework

Metric	What limits it	Expected ceiling	Mitigation
External API rate limits	Per-provider (SEMrush: 10 req/sec, GitHub: 5000/hr, Google: varies)	Varies widely per tool provider	Per-tool rate limiter in framework, queue with backoff, cache repeated queries
Concurrent outbound connections	OS socket limits, K8s network policy overhead	~1000 concurrent per pod (OS default)	Connection pooling per external service
Vault credential resolution	Same as LLM Gateway — Vault lookup latency	~10-50ms per call	Cache with TTL, pre-warm credentials on agent spawn
Tool response parsing	CPU for JSON/XML parsing	Negligible unless tool returns massive payloads	Set response size limits per tool definition

Key insight: Tool Framework scaling is dominated by external API limits, not internal capacity. The framework itself is lightweight. The main architectural concern is: don't let one agent's tool calls starve another's. Per-tenant, per-tool rate limiting is essential.

2.5 Task Scheduler

Metric	What limits it	Expected ceiling	Mitigation
Schedule density	Polling query: `WHERE next_run_at <= now`	Efficient with index — 10K+ schedules is fine	B-tree index on `next_run_at`, no full table scan
Polling contention (HA)	`FOR UPDATE SKIP LOCKED` across replicas	Minimal — SKIP LOCKED avoids blocking	N replicas naturally partition work
Event trigger throughput	Direct callback in MVP, NATS in Phase 2	~1000 events/sec is fine for direct calls	Move to NATS when event volume exceeds in-process capacity
Missed schedule catch-up	On restart, scan for overdue schedules	Only scans 1hr lookback — bounded	Index on `next_run_at` makes this fast

Key insight: Task Scheduler is the least likely bottleneck. It's a simple cron-like system with well-understood scaling characteristics. FOR UPDATE SKIP LOCKED is the proven pattern for distributed job scheduling in Postgres.

2.6 Observation Logger

Metric	What limits it	Expected ceiling	Mitigation
Write volume	Events per second from all components	~100-500 events/sec at Stage 1, ~5K-10K at Stage 3	Batched writes (already designed: 100 events or 1s flush)
Storage growth	~1KB per event average	~30GB/month at 10K events/sec	Monthly partitioning, retention policy (drop partitions >6 months, archive to object storage)
Query performance	Full-text search across months of data	Degrades on unpartitioned tables	Partition by month + tenant, indexed by task_id, agent_id, timestamp
Batch flush under back-pressure	Database write latency spike → buffer fills	Buffer limit reached → drop oldest debug events first	Already designed: graceful degradation, never blocks agent execution

Key insight: Observation Logger volume scales linearly with agent count and task frequency. The batched write design handles this well. The main concern is query performance on historical data — partitioning is the answer, not a bigger database.

3. LLM Provider Throughput

3.1 Provider Rate Limits (Approximate, subject to plan tier)

Provider	Requests/min	Tokens/min (input)	Tokens/min (output)	Concurrent
Anthropic (Tier 4)	~4,000	~400K	~80K	~—
OpenAI (Tier 5)	~10,000	~10M	~2M	~—
Google Gemini (Pay-as-you-go)	~360	~4M	~200K	~—
Ollama (local)	Hardware-bound	Hardware-bound	Hardware-bound	1 per GPU

These are rough estimates — actual limits depend on specific models, account tier, and provider policy at the time.

3.2 What This Means for Agent Throughput

A typical agent task involves:

1-3 LLM calls for reasoning (each ~2K input, ~1K output tokens)
0-2 tool calls (which may trigger additional LLM calls for parsing)
1 knowledge query (may trigger embedding generation)
Total: ~3-8 LLM calls per task, ~10-20K tokens per task

Throughput estimates per provider (single key):

Provider	Tasks/min (est.)	Tasks/hour (est.)
Anthropic (Tier 4)	~500-1300	~30K-80K
OpenAI (Tier 5)	~1200-3300	~75K-200K
Google Gemini	~45-120	~3K-7K

With multi-key pooling (3 keys per provider) and multi-provider routing, the system can handle ~5K-10K tasks/hour — well beyond Stage 2 requirements.

3.3 Rate Limit Management

Request comes in from Agent
  → LLM Gateway checks provider's token bucket
  → If tokens available: send immediately
  → If bucket empty but another provider eligible: route to alternative
  → If all buckets empty: queue with priority (high-priority tasks first)
  → If queue depth > threshold: reject lowest-priority requests

Priority levels:

Supervised tasks (human waiting for output) — highest
Active conversation responses — high
Scheduled background tasks — normal
Knowledge consolidation — low
Quality evaluation — lowest

3.4 Local Model Overflow

For non-critical tasks (knowledge consolidation, quality evaluation, embedding generation), local models via Ollama/vLLM can absorb overflow:

No rate limits (hardware-bound only)
No token costs
Higher latency, potentially lower quality
Good for: embedding generation, simple classification, data extraction
Not suitable for: complex reasoning, client-facing outputs

Trigger: Local model overflow activates when cloud provider queue depth exceeds threshold AND the task's model hint is fast or embed.

4. Knowledge System at Scale

4.1 pgvector Performance Characteristics

Based on published benchmarks and community data:

Vector count	Dimensions	Index type	Query latency (p95)	Memory for index
100K	1536	HNSW	~5ms	~600MB
1M	1536	HNSW	~20-50ms	~6GB
5M	1536	HNSW	~100-200ms	~30GB
10M+	1536	HNSW	~300ms+	~60GB+

HNSW tuning parameters:

m (connections per node): higher = better recall, more memory. Default 16, increase to 32-64 for better recall.
ef_construction (build quality): higher = slower build, better index quality. Default 64, increase to 128-256.
ef_search (query quality): higher = better recall, slower queries. Tune dynamically based on latency budget.

4.2 Growth Model

Estimated knowledge entries over time:

Stage	Agents	Entries/agent/day	Total entries/day	After 1 year
Stage 0	15	~20	~300	~100K
Stage 1	50	~20	~1,000	~400K
Stage 2	200	~20	~4,000	~1.5M
Stage 3	1000	~20	~20,000	~7M

At ~20 entries per agent per day (episodic events, observations, learned facts), pgvector stays performant through Stage 2. Stage 3 triggers the Qdrant addition.

4.3 Embedding Generation Throughput

Every knowledge write needs an embedding. At scale:

Stage	Writes/min	Embedding API calls/min	Latency impact
Stage 0	~0.2	~0.2	None
Stage 1	~0.7	~0.7	None
Stage 2	~3	~3	Negligible
Stage 3	~14	~14	Queue if burst

Embedding generation is not a bottleneck for knowledge writes. The real concern is embedding for knowledge queries — each semantic search needs to embed the query text. At Stage 2-3 with many concurrent agents querying simultaneously, embedding the query becomes a latency factor.

Mitigation: Cache recent query embeddings (same query text = same embedding). Batch embedding requests where possible. Use embed model hint to route to fastest/cheapest embedding model.

4.4 Mem0 Scaling

Two options for Mem0 at scale:

Option A: Shared Mem0 instance with tenant/agent routing

Pro: Simpler ops, single instance to manage
Con: Noisy neighbor risk, single point of failure
Scaling: Vertical (bigger instance) → horizontal (Mem0 clustering if supported)

Option B: Mem0 instance per cell

Pro: Natural tenant isolation, failure isolation
Con: More instances to manage, higher base resource cost
Scaling: Cell-level scaling (each cell manages its own Mem0)

Recommendation: Option B (per-cell Mem0) aligns with the cell architecture. Each cell is self-contained. Mem0 instances are lightweight and don't need cross-cell communication.

5. Database Scaling Strategy

5.1 Connection Management

Agent count drives connection demand:

Per agent: ~3-5 connections (knowledge, observation, state, tools, scheduler)
Per component: ~2-3 shared connections for internal operations
Postgres default: 100 connections

Stage 0: 15 agents × 4 = ~60 connections → fine
Stage 1: 50 agents × 4 = ~200 connections → needs PgBouncer
Stage 2: 200 agents × 4 = ~800 connections → PgBouncer + read replicas
Stage 3: 1000 agents across cells → per-cell databases

PgBouncer config:

Mode: transaction (release connection after each transaction)
Pool size: 50-100 per cell (most agents are waiting on LLM, not DB)
Reserve: 5 connections for admin/monitoring

5.2 Read Replica Strategy

Query type	Route	When to add replica
Knowledge queries (vector search)	Read replica	Stage 2 (>200 agents, concurrent reads)
Observation queries (debugging, traces)	Read replica	Stage 1 (query volume grows with event count)
Budget checks (pre-request)	Read replica (slight lag acceptable)	Stage 2
Budget updates (post-request)	Primary	Always primary (writes)
Schedule polling	Primary	Always primary (FOR UPDATE)
Agent state writes	Primary	Always primary

5.3 Partitioning Strategy

Table	Partition key	Strategy	When
`observations`	`created_at` (monthly)	Range partition by month. Drop partitions >6 months, archive to object storage.	Stage 1
`memories`	`tenant_id`	List partition by tenant. Each tenant's knowledge is physically separated.	Stage 2
`llm_usage_log`	`created_at` (monthly)	Range partition. Retention: 3 months hot, archive older.	Stage 1
`schedule_executions`	`created_at` (monthly)	Range partition. Low volume but grows indefinitely.	Stage 2

5.4 Database Split Points

Eventually, a single Postgres can't serve all components optimally:

Trigger	Action
Observation writes impacting agent state latency	Split observation tables to dedicated Postgres instance
Knowledge queries impacting other workloads	Split knowledge tables to dedicated instance (+ Qdrant)
Per-cell database makes more sense than shared	Each cell gets its own Postgres (aligns with cell architecture)

Likely split order: Observations first (highest write volume, read patterns are different from transactional queries), then knowledge (vector search is CPU-intensive and benefits from dedicated resources).

6. Messaging & Communication Scaling

6.1 Direct Calls (MVP — Stage 0-1)

Agent A calls Agent B: runtime.dispatch(agentB, task)
  → In-process function call via DirectCallTransport
  → Zero serialization, zero network hop
  → Limited to single process / single node

Advantages: Zero latency, no infrastructure, simple debugging. Ceiling: All agents must run in the same process. Can't distribute across nodes.

6.2 NATS Migration Trigger

Migrate from direct calls to NATS when:

Agent runtime spans multiple nodes (can't do in-process calls cross-node)
Need for persistent message queues (agent crashes, message survives)
Event-driven triggers need decoupling (scheduler → agent without direct reference)
Inter-cell communication begins (Phase 2+)

Expected trigger: Stage 2, when shared cells host enough agents to require multi-node deployment.

6.3 NATS Sizing

Stage	Agents	Messages/sec (est.)	NATS cluster
Stage 2	~200	~50-200	3-node cluster, 10GB JetStream storage
Stage 3	~1000	~500-2000	5-node cluster per region, 100GB JetStream

NATS is lightweight — a 3-node cluster handles millions of messages/sec. The bottleneck is never NATS throughput; it's consumer processing speed. Back-pressure management (NATS consumer acknowledgment + redelivery) handles slow consumers.

6.4 Message Patterns

Pattern	Use case	Volume
Request/Reply	Agent A asks Agent B to do a task, waits for result	Low-medium (~1-5 per task)
Publish	Events (task completed, schedule fired, budget warning)	Medium (~10-50 per task)
Queue Group	Load-balanced consumption (multiple Agent Runtime instances pick up tasks)	Medium
JetStream	Durable event log for observation replay, audit	High (all observation events)

7. Infrastructure Scaling Patterns

7.1 Kubernetes Scaling

Agent Runtime pods — HPA config:

Scale metric: agent count per pod (not CPU — agents are IO-bound, not CPU-bound)
Target: 30-50 agents per pod
Min replicas: 2 (HA)
Max replicas: per-cell limit based on tenant plan

Namespace resource quotas (per-tenant in shared cell):

CPU limit: prevents noisy neighbor
Memory limit: prevents OOM from runaway agent memory
Pod count limit: caps agent + subagent sprawl
PVC limit: caps storage claims

Node pool strategy:

General pool: platform components (Gateway, Scheduler, Logger)
Agent pool: Agent Runtime pods (potentially burstable/spot instances)
Database pool: Postgres, Qdrant (SSD-backed, consistent performance)

7.2 Monitoring Stack Scaling

Component	Scaling concern	Mitigation
Prometheus	Metric cardinality explosion (per-agent, per-tenant labels)	Label discipline, recording rules for pre-aggregation, drop high-cardinality debug metrics
Prometheus retention	Storage at high scrape frequency	15-day local retention, Thanos/Mimir for long-term storage
Loki	Log volume from agent execution	Structured logging (not free-text), log level filtering, retention policy
Grafana	Dashboard load with many tenants	Tenant-scoped dashboards, variable-based queries (not one dashboard per tenant)

Stage 2+ (multi-cell): Prometheus federation or Thanos for cross-cell metric aggregation. Each cell runs its own Prometheus; a central Thanos query layer provides the global view.

7.3 Service Mesh Overhead

mTLS everywhere (zero-trust) adds latency:

Linkerd: ~1ms p99 overhead per hop. Lightweight sidecar (~20MB memory per pod).
Cilium: eBPF-based, even lower overhead (~0.5ms). No sidecar (kernel-level).

At Kaze's scale, mesh overhead is negligible compared to LLM call latency (~5-30 seconds). The 1ms per hop is lost in the noise.

Recommendation: Start with Cilium if the K8s environment supports eBPF. Otherwise Linkerd. Don't over-optimize — mesh overhead is not a scaling concern for Kaze.

8. Scaling Decision Triggers

Concrete metrics and thresholds that trigger scaling actions. Thresholds are initial estimates — calibrate from actual MVP measurements.

Trigger Metric	Threshold	Action	Stage
Postgres connection count	>80% of pool max	Deploy PgBouncer (transaction mode)	1
Concurrent agents per node	>50	HPA scales agent runtime pods	1
pgvector query latency	p95 >200ms	Add read replica. If still >200ms: evaluate Qdrant.	2
LLM Gateway request queue depth	>100 pending for >30s	Add provider API keys to pool. If maxed: enable local model overflow.	1-2
Observation write batch flush time	p95 >50ms	Increase batch size. Add monthly partitioning if not already.	1
Agent memory per instance	p90 >512MB	Review knowledge preload strategy. Reduce context window. Run Mem0 compaction.	2
Inter-agent message latency (direct calls)	p95 >100ms OR multi-node needed	Migrate from DirectCallTransport to NatsTransport.	2
Knowledge entry count (pgvector)	>5M vectors	Add Qdrant for hot-path reads. Keep pgvector as source of truth.	2-3
Observation table query time (historical)	p95 >500ms for traces	Partition by month if not already. Add indexes on task_id + timestamp.	1
Budget tracking write contention	>10% lock timeouts	Batch budget updates (aggregate per 5s window instead of per-request).	2
Postgres total database size	>500GB per instance	Split by component (observations, knowledge, agent state → separate instances).	3
LLM provider total throughput	>80% of pooled key capacity	Negotiate higher tier. Add providers. Shift low-priority to local models.	2-3
NATS message backlog	>10K unacknowledged per consumer	Scale consumer pods. If persistent: review consumer processing bottleneck.	2-3
Embedding generation queue	>100 pending writes	Batch embedding calls. Add dedicated embedding service with local model.	2-3

Monitoring these triggers:

All thresholds should be Prometheus alerts (warning at 70% of threshold, critical at threshold).
Dashboard with "scaling readiness" view: each trigger as a gauge showing current value vs threshold.
Weekly review in Stage 1-2 to calibrate thresholds based on actual production data.

9. Anti-Patterns to Avoid

Anti-pattern	Why it's tempting	Why it's wrong	What to do instead
Pre-sharding the database	"We'll need it eventually"	Adds massive complexity before it's needed. Single Postgres handles more than you think.	Start with one Postgres, split only when metrics demand it.
Running NATS from day one	"We'll migrate eventually anyway"	Adds operational overhead, debugging complexity, message ordering concerns — all for 15 agents that run in one process.	DirectCallTransport until multi-node is required. Same message interface, swap transport.
Deploying Qdrant alongside pgvector at MVP	"Vector search will be slow"	pgvector is fast up to ~5M vectors. Two vector stores means two consistency models.	pgvector only until query latency triggers the migration.
Per-agent Postgres connection	"Each agent needs its own connection"	200 agents = 200 connections. Postgres doesn't scale that way.	Connection pooling (PgBouncer). Agents share pool connections.
Caching everything	"Caching improves performance"	Cache invalidation is hard. Knowledge consistency matters. Stale knowledge is worse than slow knowledge.	Cache only: Vault responses (TTL), model hint resolution (config change event), query embeddings (immutable).
Horizontal scaling before vertical	"Scale out for resilience"	Running 10 small instances instead of 2 right-sized ones adds network hops, coordination overhead, and debugging complexity.	Scale up first (bigger pods, bigger DB). Scale out when vertical limit is reached.

10. Key Architectural Invariants

Properties that must hold at every scale stage:

Agent code doesn't change when infrastructure scales. An agent written at Stage 0 runs unmodified at Stage 3. All scaling happens below the Agent Runtime interface.
Tenant isolation doesn't weaken at scale. Shared cells at Stage 1 have the same tenant isolation as dedicated cells at Stage 3. Namespace boundaries, network policies, and database scoping are non-negotiable.
Message shape doesn't change when transport changes. AgentMessage envelope is the same whether sent via DirectCallTransport or NatsTransport. Agent code never knows the difference.
Knowledge query interface doesn't change when storage changes. Adding Qdrant or read replicas is a backend concern. KnowledgeClient.query() signature stays the same.
Observation is always fire-and-forget. At no scale should logging block agent execution. If the observer can't keep up, it drops — never blocks.

Performance Scalability Model ​

1. Scale Milestones ​

Stage 0: MVP (Speedrun-only) ​

Stage 1: Early Clients (5-10 clients) ​

Stage 2: Growth (20-50 clients) ​

Stage 3: Scale (100+ clients) ​

2. Component Bottleneck Analysis ​

2.1 Agent Runtime ​

2.2 LLM Gateway ​

2.3 Knowledge System (Mem0 + pgvector) ​

2.4 Tool Integration Framework ​

2.5 Task Scheduler ​

2.6 Observation Logger ​

3. LLM Provider Throughput ​

3.1 Provider Rate Limits (Approximate, subject to plan tier) ​

3.2 What This Means for Agent Throughput ​

3.3 Rate Limit Management ​

3.4 Local Model Overflow ​

4. Knowledge System at Scale ​

4.1 pgvector Performance Characteristics ​

4.2 Growth Model ​

4.3 Embedding Generation Throughput ​

4.4 Mem0 Scaling ​

5. Database Scaling Strategy ​

5.1 Connection Management ​

5.2 Read Replica Strategy ​

5.3 Partitioning Strategy ​

5.4 Database Split Points ​

6. Messaging & Communication Scaling ​

6.1 Direct Calls (MVP — Stage 0-1) ​

6.2 NATS Migration Trigger ​

6.3 NATS Sizing ​

6.4 Message Patterns ​

7. Infrastructure Scaling Patterns ​

7.1 Kubernetes Scaling ​

7.2 Monitoring Stack Scaling ​

7.3 Service Mesh Overhead ​

8. Scaling Decision Triggers ​

9. Anti-Patterns to Avoid ​

10. Key Architectural Invariants ​