Non-Functional Assessment
Part of Project Kaze Architecture
Detailed models: cost-model.md · scalability-model.md
1. Security Posture
1.1 Zero-Secret Runtime
The runtime holds no API keys, no LLM credentials, no GitHub tokens. All secrets live in the gateway, injected via closures at tool registration time. If the runtime is compromised, the attacker cannot call LLM providers or external APIs directly.
┌─────────────┐ ┌──────────────────┐ ┌──────────────┐
│ Runtime │────▶│ Gateway │────▶│ Providers │
│ (no keys) │ │ (holds secrets) │ │ │
│ │ │ injects via │ │ │
│ │ │ closures │ │ │
└─────────────┘ └──────────────────┘ └──────────────┘1.2 Secrets Management
- Vault is the source of truth. K8s Secrets are derived copies via ExternalSecrets Operator.
- Kubernetes auth — pods authenticate to Vault using service accounts. No static tokens.
- 1-minute refresh — secret rotation propagates automatically.
- Separation: Each service has its own Vault path. Runtime cannot access gateway's secrets.
1.3 Capability Enforcement
Agents declare capabilities in vertical.yaml. The runtime enforces:
- Only whitelisted tools can be invoked per vertical
- Subagents inherit at most the parent's capability set
- Supervision state is read-only to agents
1.4 Credential Injection Pattern
// Gateway registers tool with credentials bound via closure
registerTool("github_api", (input) => {
// GITHUB_TOKEN is captured in closure scope at registration
// Agent never sees the token value
return callGitHub(input, GITHUB_TOKEN);
});1.5 Current Security Gaps
| Gap | Risk | Mitigation path |
|---|---|---|
| No network policies in K8s | Pods can reach any other pod | Add NetworkPolicy per namespace |
| No PII detection before LLM calls | Client data sent to providers | Add PII scanner in gateway |
| No egress filtering | Agents could reach arbitrary URLs | K8s egress policies + tool URL whitelist |
| Budget not enforced | Token spend not capped | Add budget tracking in gateway |
| Supervision stats in-memory | Reset on pod restart | Persist to database |
| No input sanitization | Prompt injection risk | Instruction hierarchy + sanitization layer |
2. Threat Model Summary
2.1 Trust Boundaries
TRUST BOUNDARY: Internet
┌───────────────────────────────────────────────────────────┐
│ │
│ Speedrun Infrastructure (K8s cluster) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Runtime · Gateway · Knowledge · Langfuse · Vault │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ LLM Providers (Anthropic, Google) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Data sent for inference — provider retention varies │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ External Tools (GitHub, future: SEMrush, Calendar) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Client data may flow to these services │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ OpenClaw Channels (Slack, WhatsApp, Telegram) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ User messages flow through channel providers │ │
│ └─────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────┘2.2 Key Threats
| # | Threat | Severity | Current status |
|---|---|---|---|
| T1 | Prompt injection — crafted input manipulates agent behavior | High | Not mitigated. Instruction hierarchy designed but not enforced. |
| T2 | Tenant isolation — one tenant accesses another's data | Critical | N/A — single tenant (Speedrun only) currently. Designed for namespace isolation. |
| T3 | Knowledge poisoning — false data enters knowledge store | Medium | Partially mitigated — Mem0 fact extraction filters noise. No quality gates on shared tier. |
| T4 | LLM data exposure — sensitive data sent to providers | Medium | Not mitigated. No PII detection, no data classification tags. |
| T5 | Credential theft — API keys stolen | High | Mitigated — Vault + zero-secret runtime + credential injection closures. |
| T6 | Agent privilege escalation — agent exceeds intended scope | Medium | Partially mitigated — capability whitelist in vertical.yaml. No per-task quotas. |
| T7 | Resource exhaustion — runaway agent burns resources | Medium | Partially mitigated — maxSteps limits agentic loops. No budget enforcement. |
| T8 | Supply chain attack — compromised dependency | Low-Medium | Partially mitigated — GHCR images, pinned deps. No dependency scanning in CI. |
| T9 | Data exfiltration — agent sends data to unauthorized endpoints | Medium | Not mitigated. No egress filtering. |
| T10 | Insider threat — operator abuses access | Low | Partially mitigated — Vault audit logging. No session recording, no bastion. |
2.3 MVP Security Priorities
Must have:
- [x] LLM API keys in Vault, never in agent code or logs
- [x] Zero-secret runtime (credential injection via gateway)
- [x] Agent capability manifest (tool whitelist per vertical)
- [x] Audit trail via Langfuse (all LLM calls traced)
- [ ] Task timeouts and tool call loop detection (maxSteps exists, no hard timeout)
Next priority:
- Network policies per namespace
- PII detection before LLM calls
- Budget enforcement (hard stops)
- Egress filtering per tenant
- Dependency scanning in CI
Full threat model details in the original threat-model.md.
3. Cost Model
3.1 Cost Structure
┌────────────────────────────────────────────────────────┐
│ VARIABLE COSTS (scale with usage) │
│ │
│ ████████████████████████████████ LLM Tokens (60-80%) │
│ ████████ External APIs (10-15%)│
│ ████ Embedding Gen (3-5%) │
│ │
│ SEMI-FIXED COSTS (scale with tenants) │
│ │
│ ██████████████ Compute / K8s │
│ ████████ Database │
│ │
│ FIXED COSTS (exist regardless) │
│ │
│ ██████ Control plane │
│ ████ CI/CD + Registry │
│ ████ Monitoring │
│ ██ Vault │
└────────────────────────────────────────────────────────┘Key insight: LLM token cost dominates. A 20% reduction in tokens per task saves more than halving infrastructure costs. Cost optimization should focus on LLM efficiency.
3.2 Cost per Task
| Task Type | Fast (Haiku) | Balanced (Sonnet) | Best (Opus) | Cheapest (Gemini Flash-Lite) |
|---|---|---|---|---|
| Simple extraction | $0.004 | $0.011 | $0.018 | $0.0004 |
| Keyword research | $0.014 | $0.042 | $0.070 | $0.002 |
| Content optimization | $0.021 | $0.063 | $0.105 | $0.002 |
| Research synthesis | $0.028 | $0.084 | $0.140 | $0.003 |
| Technical audit | $0.035 | $0.105 | $0.175 | $0.004 |
Even complex tasks cost under $0.20 with the most expensive model. The V0 Internal Ops vertical uses fast (Haiku) by default — most tasks cost $0.01-0.04.
3.3 Cost Optimization Levers
| Lever | Savings | Implementation |
|---|---|---|
| Model routing | 5-10x | Use fast for simple tasks, balanced only when needed, best rarely |
| Prompt caching (Anthropic) | 90% on cached portion | Stable system prompts cached across calls |
| Batch API (Anthropic/OpenAI) | 50% | Non-urgent tasks (quality evaluation, knowledge consolidation) |
| Gemini Flash-Lite | 10-50x vs Opus | Bulk processing, classification, data extraction |
| Knowledge context pruning | 20-30% token reduction | Only inject relevant memories, not all matches |
3.4 Infrastructure Costs (Current Stage)
| Component | Monthly est. | Notes |
|---|---|---|
| K8s cluster (3 nodes) | ~$150-300 | Depends on provider/instance type |
| PostgreSQL + pgvector | ~$50-100 | Small instance, single node |
| Langfuse | ~$0-50 | Self-hosted or free tier |
| Vault | ~$0 | Runs as pod in cluster |
| Container registry (GHCR) | ~$0 | Free for public/org repos |
| Total infra | ~$200-450/mo |
Variable costs (LLM) at current usage (V0 Internal Ops, ~50-100 tasks/day):
- Using
fast(Haiku): ~$15-60/mo - Using
balanced(Sonnet): ~$50-200/mo
4. Scalability Assessment
4.1 Scale Stages
| Stage | Scale | Architecture | First bottleneck |
|---|---|---|---|
| 0: MVP (current) | 1 cell, ~15 agents, 1 Postgres | All in one namespace, direct HTTP calls | None expected |
| 1: Early clients | 5-10 clients, ~50 agents | Shared cells, namespace isolation | LLM rate limits, Postgres connections |
| 2: Growth | 20-50 clients, ~200 agents | Mixed shared/dedicated cells, read replicas, NATS | pgvector latency, write contention |
| 3: Scale | 100+ clients, ~1000+ agents | Multi-region, sharded DBs, NATS clusters | Operational complexity |
4.2 Component Bottleneck Analysis
Agent Runtime:
- ~50-100 agents per 8GB node (est. 50-150MB per agent with loaded context)
- Almost never the bottleneck — agents spend most time waiting on LLM calls
- Scales horizontally with HPA
LLM Gateway:
- Hard ceiling: LLM provider rate limits (Anthropic ~4M tokens/min Tier 4)
- Mitigation: multi-key pooling, multi-provider fallback, request queuing
- Gateway itself is stateless, scales horizontally
Knowledge Service:
- Bottleneck: pgvector query latency as index grows
- At Stage 2 (~200 agents, millions of vectors): evaluate Qdrant for hot-path queries
- Embedding generation is batched (100/batch) — throughput adequate for current scale
PostgreSQL:
- Stage 0-1: Single instance sufficient
- Stage 2: Read replicas for knowledge queries and observation reads
- Stage 3: Per-component database split (knowledge, observations, agent state)
4.3 Horizontal vs Vertical Scaling
| Component | Scales horizontally | Scales vertically | Notes |
|---|---|---|---|
| Runtime | Yes (stateless pods) | N/A | HPA on CPU/memory |
| Gateway | Yes (stateless pods) | N/A | HPA on request count |
| Knowledge | Yes (stateless pods) | Database grows | DB is the bottleneck, not the service |
| PostgreSQL | Read replicas | Bigger instance | Write master is vertical until sharding |
| pgvector | N/A | Index optimization | Consider Qdrant at scale |
4.4 Decision Triggers
| Trigger | When | Action |
|---|---|---|
| Postgres connections > 100 | Stage 1 (~50 agents) | Add PgBouncer |
| LLM rate limit errors > 1% | Stage 1 (high throughput) | Multi-key pooling |
| Observation table > 100M rows | Stage 1-2 | Partition by month + tenant |
| pgvector query p99 > 500ms | Stage 2 (~1M vectors) | Evaluate Qdrant |
| Inter-agent calls cross nodes | Stage 2 | Introduce NATS |
| Write contention on hot tables | Stage 2 | Read replicas, batch writes |
5. Reliability
5.1 Current State
| Aspect | Status | Notes |
|---|---|---|
| Redundancy | Single instance per service | No HA — acceptable for MVP |
| Backups | Manual | PostgreSQL backup not automated |
| Health checks | Basic HTTP liveness | No readiness probes beyond startup |
| Recovery | K8s restart policy | Pod restart on crash, no data recovery automation |
| Monitoring | Langfuse (LLM only) | No infrastructure monitoring (Prometheus/Grafana) |
5.2 Error Handling (Implemented)
| Component | Failure mode | Recovery |
|---|---|---|
| Runtime | Agent crash during task | Mark task failed, log error. 3 consecutive → demote supervision. |
| Runtime | Gateway unreachable | Task fails with connection error. No retry. |
| Gateway | LLM provider error | Returns error to runtime. No automatic fallback between providers yet. |
| Gateway | Tool execution error | Returns error payload. Agent can reason about it and retry. |
| Knowledge | Database unavailable | Service returns 500. Runtime proceeds without memory context. |
5.3 What's Needed for Production
- [ ] Health/readiness probes on all services
- [ ] Automated PostgreSQL backups (daily minimum)
- [ ] Multi-replica deployments (at least 2 per service)
- [ ] Infrastructure monitoring (Prometheus + Grafana)
- [ ] Alerting (PagerDuty/Slack integration)
- [ ] Provider fallback in gateway (Anthropic → Google → etc.)
- [ ] Graceful degradation when knowledge service is down
6. Performance Characteristics
6.1 Latency Profile
| Operation | Typical latency | Bottleneck |
|---|---|---|
| Task dispatch (sync, simple) | 2-10s | LLM response time |
| Task dispatch (agentic, 3-6 steps) | 10-60s | Multiple LLM calls + tool execution |
| Knowledge search | 50-200ms | pgvector similarity search |
| Knowledge add | 1-3s | LLM fact extraction + embedding |
| Knowledge add-raw | 200-500ms | Embedding only |
| Knowledge add-raw-batch (100 items) | 5-15s | Batch embedding + insert |
| Tool execution (GitHub API) | 200-500ms | GitHub API response time |
| Tool execution (Docling) | 5-30s | Document conversion complexity |
6.2 Throughput
At current scale, throughput is not a concern. The system is designed for quality of agent reasoning, not high-throughput processing. Key constraints:
- LLM concurrency: Limited by provider rate limits, not system architecture
- Knowledge writes: Batch operations handle bulk ingestion efficiently
- Tool calls: Rate-limited by external API quotas, not internal capacity
7. Compliance & Governance Readiness
Current State
| Requirement | Status |
|---|---|
| Audit trail (who did what, when) | Partial — Langfuse traces LLM calls. No structured agent action log. |
| Data residency controls | Not applicable — single deployment. Architecture supports VPC mode. |
| Access control (RBAC/ABAC) | Not implemented. Single-tenant, single-user. |
| Data retention policies | Not implemented. Knowledge persists indefinitely. |
| Incident response | Not documented. |
Path to Compliance
| Milestone | Effort | When needed |
|---|---|---|
| SOC 2 Type I readiness | High | Before enterprise clients |
| GDPR data subject rights | Medium | Before EU clients |
| ISO 27001 certification | High | Market differentiation |
| Formal incident response playbook | Low | Before multi-tenant production |