Non-Functional Assessment

Part of Project Kaze Architecture
Detailed models: cost-model.md · scalability-model.md

1. Security Posture

1.1 Zero-Secret Runtime

The runtime holds no API keys, no LLM credentials, no GitHub tokens. All secrets live in the gateway, injected via closures at tool registration time. If the runtime is compromised, the attacker cannot call LLM providers or external APIs directly.

┌─────────────┐     ┌──────────────────┐     ┌──────────────┐
│  Runtime    │────▶│  Gateway         │────▶│  Providers   │
│  (no keys) │     │  (holds secrets)  │     │              │
│             │     │  injects via     │     │              │
│             │     │  closures        │     │              │
└─────────────┘     └──────────────────┘     └──────────────┘

1.2 Secrets Management

Vault is the source of truth. K8s Secrets are derived copies via ExternalSecrets Operator.
Kubernetes auth — pods authenticate to Vault using service accounts. No static tokens.
1-minute refresh — secret rotation propagates automatically.
Separation: Each service has its own Vault path. Runtime cannot access gateway's secrets.

1.3 Capability Enforcement

Agents declare capabilities in vertical.yaml. The runtime enforces:

Only whitelisted tools can be invoked per vertical
Subagents inherit at most the parent's capability set
Supervision state is read-only to agents

1.4 Credential Injection Pattern

typescript

// Gateway registers tool with credentials bound via closure
registerTool("github_api", (input) => {
  // GITHUB_TOKEN is captured in closure scope at registration
  // Agent never sees the token value
  return callGitHub(input, GITHUB_TOKEN);
});

1.5 Current Security Gaps

Gap	Risk	Mitigation path
No network policies in K8s	Pods can reach any other pod	Add NetworkPolicy per namespace
No PII detection before LLM calls	Client data sent to providers	Add PII scanner in gateway
No egress filtering	Agents could reach arbitrary URLs	K8s egress policies + tool URL whitelist
Budget not enforced	Token spend not capped	Add budget tracking in gateway
Supervision stats in-memory	Reset on pod restart	Persist to database
No input sanitization	Prompt injection risk	Instruction hierarchy + sanitization layer

2. Threat Model Summary

2.1 Trust Boundaries

                    TRUST BOUNDARY: Internet
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Speedrun Infrastructure (K8s cluster)                   │
│  ┌─────────────────────────────────────────────────────┐ │
│  │  Runtime · Gateway · Knowledge · Langfuse · Vault    │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                           │
│   LLM Providers (Anthropic, Google)                       │
│  ┌─────────────────────────────────────────────────────┐ │
│  │  Data sent for inference — provider retention varies │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                           │
│   External Tools (GitHub, future: SEMrush, Calendar)      │
│  ┌─────────────────────────────────────────────────────┐ │
│  │  Client data may flow to these services              │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                           │
│   OpenClaw Channels (Slack, WhatsApp, Telegram)           │
│  ┌─────────────────────────────────────────────────────┐ │
│  │  User messages flow through channel providers        │ │
│  └─────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────┘

2.2 Key Threats

#	Threat	Severity	Current status
T1	Prompt injection — crafted input manipulates agent behavior	High	Not mitigated. Instruction hierarchy designed but not enforced.
T2	Tenant isolation — one tenant accesses another's data	Critical	N/A — single tenant (Speedrun only) currently. Designed for namespace isolation.
T3	Knowledge poisoning — false data enters knowledge store	Medium	Partially mitigated — Mem0 fact extraction filters noise. No quality gates on shared tier.
T4	LLM data exposure — sensitive data sent to providers	Medium	Not mitigated. No PII detection, no data classification tags.
T5	Credential theft — API keys stolen	High	Mitigated — Vault + zero-secret runtime + credential injection closures.
T6	Agent privilege escalation — agent exceeds intended scope	Medium	Partially mitigated — capability whitelist in vertical.yaml. No per-task quotas.
T7	Resource exhaustion — runaway agent burns resources	Medium	Partially mitigated — maxSteps limits agentic loops. No budget enforcement.
T8	Supply chain attack — compromised dependency	Low-Medium	Partially mitigated — GHCR images, pinned deps. No dependency scanning in CI.
T9	Data exfiltration — agent sends data to unauthorized endpoints	Medium	Not mitigated. No egress filtering.
T10	Insider threat — operator abuses access	Low	Partially mitigated — Vault audit logging. No session recording, no bastion.

2.3 MVP Security Priorities

Must have:

[x] LLM API keys in Vault, never in agent code or logs
[x] Zero-secret runtime (credential injection via gateway)
[x] Agent capability manifest (tool whitelist per vertical)
[x] Audit trail via Langfuse (all LLM calls traced)
[ ] Task timeouts and tool call loop detection (maxSteps exists, no hard timeout)

Next priority:

Network policies per namespace
PII detection before LLM calls
Budget enforcement (hard stops)
Egress filtering per tenant
Dependency scanning in CI

Full threat model details in the original threat-model.md.

3. Cost Model

3.1 Cost Structure

┌────────────────────────────────────────────────────────┐
│  VARIABLE COSTS (scale with usage)                      │
│                                                         │
│  ████████████████████████████████  LLM Tokens (60-80%)  │
│  ████████                         External APIs (10-15%)│
│  ████                             Embedding Gen (3-5%)  │
│                                                         │
│  SEMI-FIXED COSTS (scale with tenants)                  │
│                                                         │
│  ██████████████                   Compute / K8s         │
│  ████████                         Database              │
│                                                         │
│  FIXED COSTS (exist regardless)                         │
│                                                         │
│  ██████                           Control plane         │
│  ████                             CI/CD + Registry      │
│  ████                             Monitoring            │
│  ██                               Vault                 │
└────────────────────────────────────────────────────────┘

Key insight: LLM token cost dominates. A 20% reduction in tokens per task saves more than halving infrastructure costs. Cost optimization should focus on LLM efficiency.

3.2 Cost per Task

Task Type	Fast (Haiku)	Balanced (Sonnet)	Best (Opus)	Cheapest (Gemini Flash-Lite)
Simple extraction	$0.004	$0.011	$0.018	$0.0004
Keyword research	$0.014	$0.042	$0.070	$0.002
Content optimization	$0.021	$0.063	$0.105	$0.002
Research synthesis	$0.028	$0.084	$0.140	$0.003
Technical audit	$0.035	$0.105	$0.175	$0.004

Even complex tasks cost under $0.20 with the most expensive model. The V0 Internal Ops vertical uses fast (Haiku) by default — most tasks cost $0.01-0.04.

3.3 Cost Optimization Levers

Lever	Savings	Implementation
Model routing	5-10x	Use `fast` for simple tasks, `balanced` only when needed, `best` rarely
Prompt caching (Anthropic)	90% on cached portion	Stable system prompts cached across calls
Batch API (Anthropic/OpenAI)	50%	Non-urgent tasks (quality evaluation, knowledge consolidation)
Gemini Flash-Lite	10-50x vs Opus	Bulk processing, classification, data extraction
Knowledge context pruning	20-30% token reduction	Only inject relevant memories, not all matches

3.4 Infrastructure Costs (Current Stage)

Component	Monthly est.	Notes
K8s cluster (3 nodes)	~$150-300	Depends on provider/instance type
PostgreSQL + pgvector	~$50-100	Small instance, single node
Langfuse	~$0-50	Self-hosted or free tier
Vault	~$0	Runs as pod in cluster
Container registry (GHCR)	~$0	Free for public/org repos
Total infra	~$200-450/mo

Variable costs (LLM) at current usage (V0 Internal Ops, ~50-100 tasks/day):

Using fast (Haiku): ~$15-60/mo
Using balanced (Sonnet): ~$50-200/mo

4. Scalability Assessment

4.1 Scale Stages

Stage	Scale	Architecture	First bottleneck
0: MVP (current)	1 cell, ~15 agents, 1 Postgres	All in one namespace, direct HTTP calls	None expected
1: Early clients	5-10 clients, ~50 agents	Shared cells, namespace isolation	LLM rate limits, Postgres connections
2: Growth	20-50 clients, ~200 agents	Mixed shared/dedicated cells, read replicas, NATS	pgvector latency, write contention
3: Scale	100+ clients, ~1000+ agents	Multi-region, sharded DBs, NATS clusters	Operational complexity

4.2 Component Bottleneck Analysis

Agent Runtime:

~50-100 agents per 8GB node (est. 50-150MB per agent with loaded context)
Almost never the bottleneck — agents spend most time waiting on LLM calls
Scales horizontally with HPA

LLM Gateway:

Hard ceiling: LLM provider rate limits (Anthropic ~4M tokens/min Tier 4)
Mitigation: multi-key pooling, multi-provider fallback, request queuing
Gateway itself is stateless, scales horizontally

Knowledge Service:

Bottleneck: pgvector query latency as index grows
At Stage 2 (~200 agents, millions of vectors): evaluate Qdrant for hot-path queries
Embedding generation is batched (100/batch) — throughput adequate for current scale

PostgreSQL:

Stage 0-1: Single instance sufficient
Stage 2: Read replicas for knowledge queries and observation reads
Stage 3: Per-component database split (knowledge, observations, agent state)

4.3 Horizontal vs Vertical Scaling

Component	Scales horizontally	Scales vertically	Notes
Runtime	Yes (stateless pods)	N/A	HPA on CPU/memory
Gateway	Yes (stateless pods)	N/A	HPA on request count
Knowledge	Yes (stateless pods)	Database grows	DB is the bottleneck, not the service
PostgreSQL	Read replicas	Bigger instance	Write master is vertical until sharding
pgvector	N/A	Index optimization	Consider Qdrant at scale

4.4 Decision Triggers

Trigger	When	Action
Postgres connections > 100	Stage 1 (~50 agents)	Add PgBouncer
LLM rate limit errors > 1%	Stage 1 (high throughput)	Multi-key pooling
Observation table > 100M rows	Stage 1-2	Partition by month + tenant
pgvector query p99 > 500ms	Stage 2 (~1M vectors)	Evaluate Qdrant
Inter-agent calls cross nodes	Stage 2	Introduce NATS
Write contention on hot tables	Stage 2	Read replicas, batch writes

5. Reliability

5.1 Current State

Aspect	Status	Notes
Redundancy	Single instance per service	No HA — acceptable for MVP
Backups	Manual	PostgreSQL backup not automated
Health checks	Basic HTTP liveness	No readiness probes beyond startup
Recovery	K8s restart policy	Pod restart on crash, no data recovery automation
Monitoring	Langfuse (LLM only)	No infrastructure monitoring (Prometheus/Grafana)

5.2 Error Handling (Implemented)

Component	Failure mode	Recovery
Runtime	Agent crash during task	Mark task failed, log error. 3 consecutive → demote supervision.
Runtime	Gateway unreachable	Task fails with connection error. No retry.
Gateway	LLM provider error	Returns error to runtime. No automatic fallback between providers yet.
Gateway	Tool execution error	Returns error payload. Agent can reason about it and retry.
Knowledge	Database unavailable	Service returns 500. Runtime proceeds without memory context.

5.3 What's Needed for Production

[ ] Health/readiness probes on all services
[ ] Automated PostgreSQL backups (daily minimum)
[ ] Multi-replica deployments (at least 2 per service)
[ ] Infrastructure monitoring (Prometheus + Grafana)
[ ] Alerting (PagerDuty/Slack integration)
[ ] Provider fallback in gateway (Anthropic → Google → etc.)
[ ] Graceful degradation when knowledge service is down

6. Performance Characteristics

6.1 Latency Profile

Operation	Typical latency	Bottleneck
Task dispatch (sync, simple)	2-10s	LLM response time
Task dispatch (agentic, 3-6 steps)	10-60s	Multiple LLM calls + tool execution
Knowledge search	50-200ms	pgvector similarity search
Knowledge add	1-3s	LLM fact extraction + embedding
Knowledge add-raw	200-500ms	Embedding only
Knowledge add-raw-batch (100 items)	5-15s	Batch embedding + insert
Tool execution (GitHub API)	200-500ms	GitHub API response time
Tool execution (Docling)	5-30s	Document conversion complexity

6.2 Throughput

At current scale, throughput is not a concern. The system is designed for quality of agent reasoning, not high-throughput processing. Key constraints:

LLM concurrency: Limited by provider rate limits, not system architecture
Knowledge writes: Batch operations handle bulk ingestion efficiently
Tool calls: Rate-limited by external API quotas, not internal capacity

7. Compliance & Governance Readiness

Current State

Requirement	Status
Audit trail (who did what, when)	Partial — Langfuse traces LLM calls. No structured agent action log.
Data residency controls	Not applicable — single deployment. Architecture supports VPC mode.
Access control (RBAC/ABAC)	Not implemented. Single-tenant, single-user.
Data retention policies	Not implemented. Knowledge persists indefinitely.
Incident response	Not documented.

Path to Compliance

Milestone	Effort	When needed
SOC 2 Type I readiness	High	Before enterprise clients
GDPR data subject rights	Medium	Before EU clients
ISO 27001 certification	High	Market differentiation
Formal incident response playbook	Low	Before multi-tenant production

Non-Functional Assessment ​

1. Security Posture ​

1.1 Zero-Secret Runtime ​

1.2 Secrets Management ​

1.3 Capability Enforcement ​

1.4 Credential Injection Pattern ​

1.5 Current Security Gaps ​

2. Threat Model Summary ​

2.1 Trust Boundaries ​

2.2 Key Threats ​

2.3 MVP Security Priorities ​

3. Cost Model ​

3.1 Cost Structure ​

3.2 Cost per Task ​

3.3 Cost Optimization Levers ​

3.4 Infrastructure Costs (Current Stage) ​

4. Scalability Assessment ​

4.1 Scale Stages ​

4.2 Component Bottleneck Analysis ​

4.3 Horizontal vs Vertical Scaling ​

4.4 Decision Triggers ​

5. Reliability ​

5.1 Current State ​

5.2 Error Handling (Implemented) ​

5.3 What's Needed for Production ​

6. Performance Characteristics ​

6.1 Latency Profile ​

6.2 Throughput ​

7. Compliance & Governance Readiness ​

Current State ​

Path to Compliance ​