Security & Observability

Part of Project Kaze Architecture

LLM Provider & Key Management

Dual-key model:

Speedrun keys — Speedrun's own API keys across multiple LLM providers, centrally managed and monitored.
Client keys — Clients bring their own keys (e.g., Azure OpenAI credits, Anthropic volume discount, Google Cloud credits).

Key routing logic:

Agent X for Client A → use Client A's Anthropic key
Agent Y for Client A → use Speedrun's OpenAI key (client has no OpenAI credits)
Agent Z for Client B → use Client B's Azure OpenAI endpoint
Fallback: if Client A's key hits rate limit → fall back to Speedrun's key (if policy allows)

Routing is configured per tenant + agent + provider, not hardcoded.

Key storage & security:

Vault paths:
  speedrun/
    ├── anthropic-key-1
    ├── openai-key-1
    └── google-key-1
  clients/
    ├── client-a/
    │   ├── anthropic-key
    │   └── azure-openai
    └── client-b/
        └── anthropic-key

Security rules:

Client keys are encrypted at rest and access-scoped — only agents running for that client can access their keys
In customer VPC mode, client keys never leave their VPC
In agency mode, client keys are stored in Speedrun's Vault with strict tenant-scoped access policies
Every key usage is logged with full attribution — clients can see exactly which agent used their key, when, and token count

Security Architecture

Network Security

Zero-trust networking between all components. mTLS everywhere, even inside the cluster.
In agency mode: strict Kubernetes network policies — Tenant A's agents can never reach Tenant B's resources.
In customer VPC mode: clearly defined ingress/egress rules.

Secrets Management

No reliance on a single secrets provider. Vault is primary, with the ability to integrate with cloud-native secret managers where needed.
Agent credentials (API keys to client systems) never leave the deployment boundary.
In customer VPC mode, Speedrun operators have no access to client secrets.

Audit & Compliance

Every agent action is logged with full attribution.
Immutable audit logs that the client can export and own.
Required for SMEs in regulated industries (finance, healthcare).

Supply Chain Security

Signed container images.
SBOM (Software Bill of Materials) for customer VPC deployments.
Reproducible builds so customers can verify what's running in their VPC.

Identity & Trust

Agent-to-agent authentication via capability-based tokens.
Cross-cell communication secured via mTLS with signed agent manifests.
Compromised node containment — a single cell breach cannot propagate to others.

Observability in Customer VPC

The full monitoring stack deploys inside every customer VPC as part of the Kaze stack:

Customer VPC                          Speedrun Central
┌──────────────────┐                  ┌──────────────┐
│ Kaze Stack       │                  │              │
│ Monitoring Stack │                  │              │
│  - Prometheus    │                  │              │
│  - Grafana       │  health beacon   │ PagerDuty /  │
│  - Loki          │─────────────────▶│ Slack / Ops  │
│  - Alertmanager  │  (minimal, no    │              │
│                  │   PII)           │              │
│ WireGuard VPN    │                  │              │
│  endpoint        │◀─ ── ── ── ── ──│ Ops team     │
└──────────────────┘  VPN for deep    │ VPN access   │
                      investigation   └──────────────┘

Health beacon (outbound, minimal): Alertmanager sends alert name + severity to Speedrun ops. No PII, no sensitive data. This enables proactive incident detection without requiring VPN access.

VPN (inbound, on-demand): Speedrun ops team VPNs into customer monitoring dashboards for investigation and deep dives. WireGuard-based, deployed as part of the stack, with SSO authentication and short-lived sessions.

Data classification:

Data	Stays in VPC	Can flow out
Agent logs (may contain client data)	Yes	No
LLM request/response content	Yes	No
Metrics (CPU, memory, latency, error rates)	Yes	Aggregated health score only
Token usage counts	Yes	Aggregated totals (for billing)
Alert triggers	Yes	Alert name + severity only
Traces (OpenTelemetry)	Yes	No

This is a configurable policy per client. Some clients may allow anonymized metrics export; others want nothing out.

Upgrade path: GitOps (ArgoCD/Flux) pointing at Speedrun's release channel enables rolling out new versions across customer VPC deployments without logging into each one. Clients approve and apply updates through the GitOps workflow.

Security Controls

Threat model and full attack surface analysis documented in Threat Model.

Tenant Isolation Enforcement

Database layer: Every query includes a tenant_id filter enforced by a query wrapper at the data access layer — not just application logic. No query can execute without tenant scoping.
Shared runtime: LLM Gateway and Agent Runtime flush all state between requests from different tenants. No context bleed.
Network: K8s network policies per namespace verified and tested. Tenant A's pods cannot reach Tenant B's resources.

Egress Filtering

Agents can only reach whitelisted external endpoints, configured per tenant and per vertical.
K8s network policies enforce egress restrictions — agents cannot open arbitrary outbound connections.
Tool Framework validates target URLs against the whitelist before executing external API calls.

Credential Lifecycle

Rotation: Automated Vault key rotation on schedule. Immediate rotation on suspected compromise.
Short-lived tokens preferred over long-lived API keys where providers support it (OAuth2 token refresh).
Anomaly detection: Usage spike on a key (e.g., 10x normal) triggers alert + auto-freeze pending review.
Blast radius: One compromised client key affects only that client's agents. Speedrun keys are separate.

Operator Access Controls

VPN sessions: Time-limited (4hr max), require ticket justification, logged with operator identity.
Vault audit: Every secret read logged — who, when, which secret, from where.
Database access: Via bastion host with session recording. No direct DB access from developer machines.
Separation of duties: No single person can both deploy code and access production secrets.

Security & Observability ​

LLM Provider & Key Management ​

Security Architecture ​

Network Security ​

Secrets Management ​

Audit & Compliance ​

Supply Chain Security ​

Identity & Trust ​

Observability in Customer VPC ​

Security Controls ​

Tenant Isolation Enforcement ​

Egress Filtering ​

Credential Lifecycle ​

Operator Access Controls ​