Performance Scalability Model
Research for Project Kaze
1. Scale Milestones
Four stages of growth, each with different architectural requirements.
Stage 0: MVP (Speedrun-only)
1 cell · 3 verticals · ~15 agents · 1 Postgres · Direct calls · Single K8s cluster- All components run in one namespace on one cluster.
- Single Postgres instance handles everything (agent state, knowledge, observations, schedules, budgets).
- Inter-agent messaging via direct function calls (in-process).
- LLM calls through 1-2 provider API keys per provider.
- Monitoring stack is minimal (local Prometheus + Grafana).
- Bottleneck: None expected — this is well within single-node capacity.
Stage 1: Early Clients (5-10 clients)
1-3 shared cells · 3-5 verticals · ~50 agents · Shared Postgres with connection pooling · Direct calls still viable- Shared cells with namespace isolation for cost efficiency.
- Connection pooling (PgBouncer) becomes necessary — 50 agents × multiple connections per agent.
- Observation Logger write volume starts growing — implement table partitioning.
- LLM provider rate limits become relevant — need key pooling (multiple keys per provider).
- Vault access patterns increase — implement Vault response caching in LLM Gateway.
- First bottleneck: LLM provider rate limits and Postgres connection count.
Stage 2: Growth (20-50 clients)
5-15 cells (mix shared + dedicated) · 5+ verticals · ~200 agents · Read replicas · NATS migration · Multiple K8s clusters possible- Dedicated cells for large/sensitive clients, shared cells for small clients.
- Postgres read replicas for knowledge queries and observation reads.
- NATS introduced for inter-agent messaging (multi-node deployment makes direct calls impractical).
- pgvector index size growing — evaluate Qdrant for hot-path vector queries.
- Observation storage partitioned aggressively (by month + tenant).
- Multiple LLM provider accounts with key rotation.
- Monitoring scales: Prometheus federation or Thanos for cross-cell aggregation.
- First bottleneck: pgvector query latency, Postgres write contention on hot tables, inter-agent communication across nodes.
Stage 3: Scale (100+ clients)
30+ cells · Multi-region · ~1000+ agents · Sharded databases · NATS clusters · Customer VPC deployments- Multi-region deployment with cell placement based on client geography.
- Per-component database split: knowledge system gets its own Postgres, observations get its own, agent state gets its own.
- NATS superclusters for cross-region messaging.
- Qdrant as dedicated vector DB alongside Postgres (pgvector still for cold-path, Qdrant for hot-path).
- Horizontal scaling of LLM Gateway as a standalone service with distributed rate limiting.
- Customer VPC deployments operating independently.
- Federated monitoring with Thanos/Mimir.
- First bottleneck: Operational complexity, cross-region latency, VPC deployment automation.
2. Component Bottleneck Analysis
2.1 Agent Runtime
| Metric | What limits it | Expected ceiling | Mitigation |
|---|---|---|---|
| Concurrent agents per node | Memory — each agent holds context, loaded skill definitions, active task state | ~50-100 agents per 8GB node (est. 50-150MB per agent with loaded context) | HPA on agent runtime pods, distribute across nodes |
| Task throughput | LLM call latency — most tasks are LLM-bound, not compute-bound | Limited by LLM Gateway throughput, not runtime itself | Parallelize independent subtasks, pipeline LLM calls |
| Subagent fan-out | Memory + LLM concurrency — each subagent is a new agent instance | Depth limit (3-4 levels) + breadth limit (5-10 per parent) enforced by capability manifest | Hard limits in runtime config, parent budget shared |
| OpenClaw subprocess | Process count — OpenClaw spawns LLM backends as child processes | OS limits (~1000 processes), memory per process | Process pooling, shared backend instances across agents |
| Skill definition loading | Disk I/O + YAML parsing on agent spawn | Negligible — YAML files are small | Cache parsed definitions in memory after first load |
Key insight: Agent Runtime is almost never the bottleneck. Agents spend most of their time waiting on LLM calls, knowledge queries, and tool responses. The runtime itself is lightweight orchestration.
2.2 LLM Gateway
| Metric | What limits it | Expected ceiling | Mitigation |
|---|---|---|---|
| Provider rate limits | Provider-imposed (tokens/min, requests/min) | Anthropic: ~4M tokens/min (Tier 4), OpenAI: ~10M tokens/min (Tier 5) — varies by plan | Multi-key pooling, multi-provider fallback, request queuing with priority |
| Concurrent in-flight requests | Connection pool to providers + memory for streaming responses | ~100-500 concurrent (depends on average response time ~5-30s) | Connection pooling, streaming response relay (don't buffer full response) |
| Budget tracking writes | FOR UPDATE SKIP LOCKED contention on budget rows | Contention at ~50+ concurrent agents for same tenant | Pre-request estimate is a read (cached), post-request update is the write. Batch budget updates every N seconds instead of per-request |
| Key resolution latency | Vault lookup per request | ~10-50ms per Vault call | Cache Vault responses with TTL (60s). Key doesn't change often. |
| Model hint resolution | Tenant config lookup | Negligible — config cached in memory | Reload on config change event |
Key insight: LLM provider rate limits are the hard ceiling. Everything else in the gateway can be scaled horizontally. The strategy is: maximize tokens processed per dollar per second across all available providers and keys.
Multi-key pooling:
Provider: Anthropic
Key pool:
speedrun-key-1 (Tier 4): 4M tokens/min
speedrun-key-2 (Tier 2): 1M tokens/min
client-a-key (Tier 3): 2M tokens/min (for Client A only)
Total Anthropic capacity: 7M tokens/min
Routing: round-robin across eligible keys per request
(eligible = keys the requesting tenant is allowed to use)2.3 Knowledge System (Mem0 + pgvector)
| Metric | What limits it | Expected ceiling | Mitigation |
|---|---|---|---|
| pgvector query latency | HNSW index size in memory | <50ms at ~1M vectors (1536-dim), degrades above ~5M | HNSW ef_search tuning, add Qdrant at >5M vectors |
| Concurrent vector searches | Postgres connection pool + CPU for ANN search | ~20-50 concurrent searches per Postgres instance | Read replicas for knowledge reads, connection pooling |
| Embedding generation | LLM/embedding API call per knowledge write | Serialized per write — ~50-200ms per embedding | Batch embedding generation, async write pipeline (embed in background, index when ready) |
| Mem0 instance memory | Per-agent episodic memory storage | Grows with conversation length — compact old episodes | Mem0's built-in compaction, configure max memory window |
| Version history bloat | Every knowledge write creates a version entry | Linear growth — manageable for years | Compact old versions (keep latest N per entry), archive to cold storage |
| Index rebuild time | Full HNSW rebuild when adding vectors | Minutes at 1M vectors, hours at 10M+ | Incremental index updates (pgvector supports this), or Qdrant which handles online indexing |
Key insight: pgvector is the first knowledge bottleneck. It's excellent up to ~5M vectors but degrades beyond that. The mitigation path is clear: add Qdrant for hot-path queries (agent reasoning) while keeping pgvector for cold-path (batch analytics, quality gate evaluation).
Scaling path:
Stage 0-1: pgvector only
↓ trigger: p95 query latency >200ms OR index >5M vectors
Stage 2: pgvector + Qdrant
Qdrant: hot-path reads during agent reasoning
pgvector: writes (single source of truth), cold reads, analytics
↓ trigger: write throughput >1000/min sustained
Stage 3: Qdrant primary for reads, Postgres for relational + writes
Async replication: Postgres → Qdrant via CDC2.4 Tool Integration Framework
| Metric | What limits it | Expected ceiling | Mitigation |
|---|---|---|---|
| External API rate limits | Per-provider (SEMrush: 10 req/sec, GitHub: 5000/hr, Google: varies) | Varies widely per tool provider | Per-tool rate limiter in framework, queue with backoff, cache repeated queries |
| Concurrent outbound connections | OS socket limits, K8s network policy overhead | ~1000 concurrent per pod (OS default) | Connection pooling per external service |
| Vault credential resolution | Same as LLM Gateway — Vault lookup latency | ~10-50ms per call | Cache with TTL, pre-warm credentials on agent spawn |
| Tool response parsing | CPU for JSON/XML parsing | Negligible unless tool returns massive payloads | Set response size limits per tool definition |
Key insight: Tool Framework scaling is dominated by external API limits, not internal capacity. The framework itself is lightweight. The main architectural concern is: don't let one agent's tool calls starve another's. Per-tenant, per-tool rate limiting is essential.
2.5 Task Scheduler
| Metric | What limits it | Expected ceiling | Mitigation |
|---|---|---|---|
| Schedule density | Polling query: WHERE next_run_at <= now | Efficient with index — 10K+ schedules is fine | B-tree index on next_run_at, no full table scan |
| Polling contention (HA) | FOR UPDATE SKIP LOCKED across replicas | Minimal — SKIP LOCKED avoids blocking | N replicas naturally partition work |
| Event trigger throughput | Direct callback in MVP, NATS in Phase 2 | ~1000 events/sec is fine for direct calls | Move to NATS when event volume exceeds in-process capacity |
| Missed schedule catch-up | On restart, scan for overdue schedules | Only scans 1hr lookback — bounded | Index on next_run_at makes this fast |
Key insight: Task Scheduler is the least likely bottleneck. It's a simple cron-like system with well-understood scaling characteristics. FOR UPDATE SKIP LOCKED is the proven pattern for distributed job scheduling in Postgres.
2.6 Observation Logger
| Metric | What limits it | Expected ceiling | Mitigation |
|---|---|---|---|
| Write volume | Events per second from all components | ~100-500 events/sec at Stage 1, ~5K-10K at Stage 3 | Batched writes (already designed: 100 events or 1s flush) |
| Storage growth | ~1KB per event average | ~30GB/month at 10K events/sec | Monthly partitioning, retention policy (drop partitions >6 months, archive to object storage) |
| Query performance | Full-text search across months of data | Degrades on unpartitioned tables | Partition by month + tenant, indexed by task_id, agent_id, timestamp |
| Batch flush under back-pressure | Database write latency spike → buffer fills | Buffer limit reached → drop oldest debug events first | Already designed: graceful degradation, never blocks agent execution |
Key insight: Observation Logger volume scales linearly with agent count and task frequency. The batched write design handles this well. The main concern is query performance on historical data — partitioning is the answer, not a bigger database.
3. LLM Provider Throughput
3.1 Provider Rate Limits (Approximate, subject to plan tier)
| Provider | Requests/min | Tokens/min (input) | Tokens/min (output) | Concurrent |
|---|---|---|---|---|
| Anthropic (Tier 4) | ~4,000 | ~400K | ~80K | ~— |
| OpenAI (Tier 5) | ~10,000 | ~10M | ~2M | ~— |
| Google Gemini (Pay-as-you-go) | ~360 | ~4M | ~200K | ~— |
| Ollama (local) | Hardware-bound | Hardware-bound | Hardware-bound | 1 per GPU |
These are rough estimates — actual limits depend on specific models, account tier, and provider policy at the time.
3.2 What This Means for Agent Throughput
A typical agent task involves:
- 1-3 LLM calls for reasoning (each ~2K input, ~1K output tokens)
- 0-2 tool calls (which may trigger additional LLM calls for parsing)
- 1 knowledge query (may trigger embedding generation)
- Total: ~3-8 LLM calls per task, ~10-20K tokens per task
Throughput estimates per provider (single key):
| Provider | Tasks/min (est.) | Tasks/hour (est.) |
|---|---|---|
| Anthropic (Tier 4) | ~500-1300 | ~30K-80K |
| OpenAI (Tier 5) | ~1200-3300 | ~75K-200K |
| Google Gemini | ~45-120 | ~3K-7K |
With multi-key pooling (3 keys per provider) and multi-provider routing, the system can handle ~5K-10K tasks/hour — well beyond Stage 2 requirements.
3.3 Rate Limit Management
Request comes in from Agent
→ LLM Gateway checks provider's token bucket
→ If tokens available: send immediately
→ If bucket empty but another provider eligible: route to alternative
→ If all buckets empty: queue with priority (high-priority tasks first)
→ If queue depth > threshold: reject lowest-priority requestsPriority levels:
- Supervised tasks (human waiting for output) — highest
- Active conversation responses — high
- Scheduled background tasks — normal
- Knowledge consolidation — low
- Quality evaluation — lowest
3.4 Local Model Overflow
For non-critical tasks (knowledge consolidation, quality evaluation, embedding generation), local models via Ollama/vLLM can absorb overflow:
- No rate limits (hardware-bound only)
- No token costs
- Higher latency, potentially lower quality
- Good for: embedding generation, simple classification, data extraction
- Not suitable for: complex reasoning, client-facing outputs
Trigger: Local model overflow activates when cloud provider queue depth exceeds threshold AND the task's model hint is fast or embed.
4. Knowledge System at Scale
4.1 pgvector Performance Characteristics
Based on published benchmarks and community data:
| Vector count | Dimensions | Index type | Query latency (p95) | Memory for index |
|---|---|---|---|---|
| 100K | 1536 | HNSW | ~5ms | ~600MB |
| 1M | 1536 | HNSW | ~20-50ms | ~6GB |
| 5M | 1536 | HNSW | ~100-200ms | ~30GB |
| 10M+ | 1536 | HNSW | ~300ms+ | ~60GB+ |
HNSW tuning parameters:
m(connections per node): higher = better recall, more memory. Default 16, increase to 32-64 for better recall.ef_construction(build quality): higher = slower build, better index quality. Default 64, increase to 128-256.ef_search(query quality): higher = better recall, slower queries. Tune dynamically based on latency budget.
4.2 Growth Model
Estimated knowledge entries over time:
| Stage | Agents | Entries/agent/day | Total entries/day | After 1 year |
|---|---|---|---|---|
| Stage 0 | 15 | ~20 | ~300 | ~100K |
| Stage 1 | 50 | ~20 | ~1,000 | ~400K |
| Stage 2 | 200 | ~20 | ~4,000 | ~1.5M |
| Stage 3 | 1000 | ~20 | ~20,000 | ~7M |
At ~20 entries per agent per day (episodic events, observations, learned facts), pgvector stays performant through Stage 2. Stage 3 triggers the Qdrant addition.
4.3 Embedding Generation Throughput
Every knowledge write needs an embedding. At scale:
| Stage | Writes/min | Embedding API calls/min | Latency impact |
|---|---|---|---|
| Stage 0 | ~0.2 | ~0.2 | None |
| Stage 1 | ~0.7 | ~0.7 | None |
| Stage 2 | ~3 | ~3 | Negligible |
| Stage 3 | ~14 | ~14 | Queue if burst |
Embedding generation is not a bottleneck for knowledge writes. The real concern is embedding for knowledge queries — each semantic search needs to embed the query text. At Stage 2-3 with many concurrent agents querying simultaneously, embedding the query becomes a latency factor.
Mitigation: Cache recent query embeddings (same query text = same embedding). Batch embedding requests where possible. Use embed model hint to route to fastest/cheapest embedding model.
4.4 Mem0 Scaling
Two options for Mem0 at scale:
Option A: Shared Mem0 instance with tenant/agent routing
- Pro: Simpler ops, single instance to manage
- Con: Noisy neighbor risk, single point of failure
- Scaling: Vertical (bigger instance) → horizontal (Mem0 clustering if supported)
Option B: Mem0 instance per cell
- Pro: Natural tenant isolation, failure isolation
- Con: More instances to manage, higher base resource cost
- Scaling: Cell-level scaling (each cell manages its own Mem0)
Recommendation: Option B (per-cell Mem0) aligns with the cell architecture. Each cell is self-contained. Mem0 instances are lightweight and don't need cross-cell communication.
5. Database Scaling Strategy
5.1 Connection Management
Agent count drives connection demand:
Per agent: ~3-5 connections (knowledge, observation, state, tools, scheduler)
Per component: ~2-3 shared connections for internal operations
Postgres default: 100 connections
Stage 0: 15 agents × 4 = ~60 connections → fine
Stage 1: 50 agents × 4 = ~200 connections → needs PgBouncer
Stage 2: 200 agents × 4 = ~800 connections → PgBouncer + read replicas
Stage 3: 1000 agents across cells → per-cell databasesPgBouncer config:
- Mode:
transaction(release connection after each transaction) - Pool size: 50-100 per cell (most agents are waiting on LLM, not DB)
- Reserve: 5 connections for admin/monitoring
5.2 Read Replica Strategy
| Query type | Route | When to add replica |
|---|---|---|
| Knowledge queries (vector search) | Read replica | Stage 2 (>200 agents, concurrent reads) |
| Observation queries (debugging, traces) | Read replica | Stage 1 (query volume grows with event count) |
| Budget checks (pre-request) | Read replica (slight lag acceptable) | Stage 2 |
| Budget updates (post-request) | Primary | Always primary (writes) |
| Schedule polling | Primary | Always primary (FOR UPDATE) |
| Agent state writes | Primary | Always primary |
5.3 Partitioning Strategy
| Table | Partition key | Strategy | When |
|---|---|---|---|
observations | created_at (monthly) | Range partition by month. Drop partitions >6 months, archive to object storage. | Stage 1 |
memories | tenant_id | List partition by tenant. Each tenant's knowledge is physically separated. | Stage 2 |
llm_usage_log | created_at (monthly) | Range partition. Retention: 3 months hot, archive older. | Stage 1 |
schedule_executions | created_at (monthly) | Range partition. Low volume but grows indefinitely. | Stage 2 |
5.4 Database Split Points
Eventually, a single Postgres can't serve all components optimally:
| Trigger | Action |
|---|---|
| Observation writes impacting agent state latency | Split observation tables to dedicated Postgres instance |
| Knowledge queries impacting other workloads | Split knowledge tables to dedicated instance (+ Qdrant) |
| Per-cell database makes more sense than shared | Each cell gets its own Postgres (aligns with cell architecture) |
Likely split order: Observations first (highest write volume, read patterns are different from transactional queries), then knowledge (vector search is CPU-intensive and benefits from dedicated resources).
6. Messaging & Communication Scaling
6.1 Direct Calls (MVP — Stage 0-1)
Agent A calls Agent B: runtime.dispatch(agentB, task)
→ In-process function call via DirectCallTransport
→ Zero serialization, zero network hop
→ Limited to single process / single nodeAdvantages: Zero latency, no infrastructure, simple debugging. Ceiling: All agents must run in the same process. Can't distribute across nodes.
6.2 NATS Migration Trigger
Migrate from direct calls to NATS when:
- Agent runtime spans multiple nodes (can't do in-process calls cross-node)
- Need for persistent message queues (agent crashes, message survives)
- Event-driven triggers need decoupling (scheduler → agent without direct reference)
- Inter-cell communication begins (Phase 2+)
Expected trigger: Stage 2, when shared cells host enough agents to require multi-node deployment.
6.3 NATS Sizing
| Stage | Agents | Messages/sec (est.) | NATS cluster |
|---|---|---|---|
| Stage 2 | ~200 | ~50-200 | 3-node cluster, 10GB JetStream storage |
| Stage 3 | ~1000 | ~500-2000 | 5-node cluster per region, 100GB JetStream |
NATS is lightweight — a 3-node cluster handles millions of messages/sec. The bottleneck is never NATS throughput; it's consumer processing speed. Back-pressure management (NATS consumer acknowledgment + redelivery) handles slow consumers.
6.4 Message Patterns
| Pattern | Use case | Volume |
|---|---|---|
| Request/Reply | Agent A asks Agent B to do a task, waits for result | Low-medium (~1-5 per task) |
| Publish | Events (task completed, schedule fired, budget warning) | Medium (~10-50 per task) |
| Queue Group | Load-balanced consumption (multiple Agent Runtime instances pick up tasks) | Medium |
| JetStream | Durable event log for observation replay, audit | High (all observation events) |
7. Infrastructure Scaling Patterns
7.1 Kubernetes Scaling
Agent Runtime pods — HPA config:
- Scale metric: agent count per pod (not CPU — agents are IO-bound, not CPU-bound)
- Target: 30-50 agents per pod
- Min replicas: 2 (HA)
- Max replicas: per-cell limit based on tenant plan
Namespace resource quotas (per-tenant in shared cell):
- CPU limit: prevents noisy neighbor
- Memory limit: prevents OOM from runaway agent memory
- Pod count limit: caps agent + subagent sprawl
- PVC limit: caps storage claims
Node pool strategy:
- General pool: platform components (Gateway, Scheduler, Logger)
- Agent pool: Agent Runtime pods (potentially burstable/spot instances)
- Database pool: Postgres, Qdrant (SSD-backed, consistent performance)
7.2 Monitoring Stack Scaling
| Component | Scaling concern | Mitigation |
|---|---|---|
| Prometheus | Metric cardinality explosion (per-agent, per-tenant labels) | Label discipline, recording rules for pre-aggregation, drop high-cardinality debug metrics |
| Prometheus retention | Storage at high scrape frequency | 15-day local retention, Thanos/Mimir for long-term storage |
| Loki | Log volume from agent execution | Structured logging (not free-text), log level filtering, retention policy |
| Grafana | Dashboard load with many tenants | Tenant-scoped dashboards, variable-based queries (not one dashboard per tenant) |
Stage 2+ (multi-cell): Prometheus federation or Thanos for cross-cell metric aggregation. Each cell runs its own Prometheus; a central Thanos query layer provides the global view.
7.3 Service Mesh Overhead
mTLS everywhere (zero-trust) adds latency:
- Linkerd: ~1ms p99 overhead per hop. Lightweight sidecar (~20MB memory per pod).
- Cilium: eBPF-based, even lower overhead (~0.5ms). No sidecar (kernel-level).
At Kaze's scale, mesh overhead is negligible compared to LLM call latency (~5-30 seconds). The 1ms per hop is lost in the noise.
Recommendation: Start with Cilium if the K8s environment supports eBPF. Otherwise Linkerd. Don't over-optimize — mesh overhead is not a scaling concern for Kaze.
8. Scaling Decision Triggers
Concrete metrics and thresholds that trigger scaling actions. Thresholds are initial estimates — calibrate from actual MVP measurements.
| Trigger Metric | Threshold | Action | Stage |
|---|---|---|---|
| Postgres connection count | >80% of pool max | Deploy PgBouncer (transaction mode) | 1 |
| Concurrent agents per node | >50 | HPA scales agent runtime pods | 1 |
| pgvector query latency | p95 >200ms | Add read replica. If still >200ms: evaluate Qdrant. | 2 |
| LLM Gateway request queue depth | >100 pending for >30s | Add provider API keys to pool. If maxed: enable local model overflow. | 1-2 |
| Observation write batch flush time | p95 >50ms | Increase batch size. Add monthly partitioning if not already. | 1 |
| Agent memory per instance | p90 >512MB | Review knowledge preload strategy. Reduce context window. Run Mem0 compaction. | 2 |
| Inter-agent message latency (direct calls) | p95 >100ms OR multi-node needed | Migrate from DirectCallTransport to NatsTransport. | 2 |
| Knowledge entry count (pgvector) | >5M vectors | Add Qdrant for hot-path reads. Keep pgvector as source of truth. | 2-3 |
| Observation table query time (historical) | p95 >500ms for traces | Partition by month if not already. Add indexes on task_id + timestamp. | 1 |
| Budget tracking write contention | >10% lock timeouts | Batch budget updates (aggregate per 5s window instead of per-request). | 2 |
| Postgres total database size | >500GB per instance | Split by component (observations, knowledge, agent state → separate instances). | 3 |
| LLM provider total throughput | >80% of pooled key capacity | Negotiate higher tier. Add providers. Shift low-priority to local models. | 2-3 |
| NATS message backlog | >10K unacknowledged per consumer | Scale consumer pods. If persistent: review consumer processing bottleneck. | 2-3 |
| Embedding generation queue | >100 pending writes | Batch embedding calls. Add dedicated embedding service with local model. | 2-3 |
Monitoring these triggers:
- All thresholds should be Prometheus alerts (warning at 70% of threshold, critical at threshold).
- Dashboard with "scaling readiness" view: each trigger as a gauge showing current value vs threshold.
- Weekly review in Stage 1-2 to calibrate thresholds based on actual production data.
9. Anti-Patterns to Avoid
| Anti-pattern | Why it's tempting | Why it's wrong | What to do instead |
|---|---|---|---|
| Pre-sharding the database | "We'll need it eventually" | Adds massive complexity before it's needed. Single Postgres handles more than you think. | Start with one Postgres, split only when metrics demand it. |
| Running NATS from day one | "We'll migrate eventually anyway" | Adds operational overhead, debugging complexity, message ordering concerns — all for 15 agents that run in one process. | DirectCallTransport until multi-node is required. Same message interface, swap transport. |
| Deploying Qdrant alongside pgvector at MVP | "Vector search will be slow" | pgvector is fast up to ~5M vectors. Two vector stores means two consistency models. | pgvector only until query latency triggers the migration. |
| Per-agent Postgres connection | "Each agent needs its own connection" | 200 agents = 200 connections. Postgres doesn't scale that way. | Connection pooling (PgBouncer). Agents share pool connections. |
| Caching everything | "Caching improves performance" | Cache invalidation is hard. Knowledge consistency matters. Stale knowledge is worse than slow knowledge. | Cache only: Vault responses (TTL), model hint resolution (config change event), query embeddings (immutable). |
| Horizontal scaling before vertical | "Scale out for resilience" | Running 10 small instances instead of 2 right-sized ones adds network hops, coordination overhead, and debugging complexity. | Scale up first (bigger pods, bigger DB). Scale out when vertical limit is reached. |
10. Key Architectural Invariants
Properties that must hold at every scale stage:
- Agent code doesn't change when infrastructure scales. An agent written at Stage 0 runs unmodified at Stage 3. All scaling happens below the Agent Runtime interface.
- Tenant isolation doesn't weaken at scale. Shared cells at Stage 1 have the same tenant isolation as dedicated cells at Stage 3. Namespace boundaries, network policies, and database scoping are non-negotiable.
- Message shape doesn't change when transport changes.
AgentMessageenvelope is the same whether sent via DirectCallTransport or NatsTransport. Agent code never knows the difference. - Knowledge query interface doesn't change when storage changes. Adding Qdrant or read replicas is a backend concern.
KnowledgeClient.query()signature stays the same. - Observation is always fire-and-forget. At no scale should logging block agent execution. If the observer can't keep up, it drops — never blocks.