Skip to content

Security & Observability

Part of Project Kaze Architecture

LLM Provider & Key Management

Dual-key model:

  • Speedrun keys — Speedrun's own API keys across multiple LLM providers, centrally managed and monitored.
  • Client keys — Clients bring their own keys (e.g., Azure OpenAI credits, Anthropic volume discount, Google Cloud credits).

Key routing logic:

  • Agent X for Client A → use Client A's Anthropic key
  • Agent Y for Client A → use Speedrun's OpenAI key (client has no OpenAI credits)
  • Agent Z for Client B → use Client B's Azure OpenAI endpoint
  • Fallback: if Client A's key hits rate limit → fall back to Speedrun's key (if policy allows)

Routing is configured per tenant + agent + provider, not hardcoded.

Key storage & security:

Vault paths:
  speedrun/
    ├── anthropic-key-1
    ├── openai-key-1
    └── google-key-1
  clients/
    ├── client-a/
    │   ├── anthropic-key
    │   └── azure-openai
    └── client-b/
        └── anthropic-key

Security rules:

  • Client keys are encrypted at rest and access-scoped — only agents running for that client can access their keys
  • In customer VPC mode, client keys never leave their VPC
  • In agency mode, client keys are stored in Speedrun's Vault with strict tenant-scoped access policies
  • Every key usage is logged with full attribution — clients can see exactly which agent used their key, when, and token count

Security Architecture

Network Security

  • Zero-trust networking between all components. mTLS everywhere, even inside the cluster.
  • In agency mode: strict Kubernetes network policies — Tenant A's agents can never reach Tenant B's resources.
  • In customer VPC mode: clearly defined ingress/egress rules.

Secrets Management

  • No reliance on a single secrets provider. Vault is primary, with the ability to integrate with cloud-native secret managers where needed.
  • Agent credentials (API keys to client systems) never leave the deployment boundary.
  • In customer VPC mode, Speedrun operators have no access to client secrets.

Audit & Compliance

  • Every agent action is logged with full attribution.
  • Immutable audit logs that the client can export and own.
  • Required for SMEs in regulated industries (finance, healthcare).

Supply Chain Security

  • Signed container images.
  • SBOM (Software Bill of Materials) for customer VPC deployments.
  • Reproducible builds so customers can verify what's running in their VPC.

Identity & Trust

  • Agent-to-agent authentication via capability-based tokens.
  • Cross-cell communication secured via mTLS with signed agent manifests.
  • Compromised node containment — a single cell breach cannot propagate to others.

Observability in Customer VPC

The full monitoring stack deploys inside every customer VPC as part of the Kaze stack:

Customer VPC                          Speedrun Central
┌──────────────────┐                  ┌──────────────┐
│ Kaze Stack       │                  │              │
│ Monitoring Stack │                  │              │
│  - Prometheus    │                  │              │
│  - Grafana       │  health beacon   │ PagerDuty /  │
│  - Loki          │─────────────────▶│ Slack / Ops  │
│  - Alertmanager  │  (minimal, no    │              │
│                  │   PII)           │              │
│ WireGuard VPN    │                  │              │
│  endpoint        │◀─ ── ── ── ── ──│ Ops team     │
└──────────────────┘  VPN for deep    │ VPN access   │
                      investigation   └──────────────┘

Health beacon (outbound, minimal): Alertmanager sends alert name + severity to Speedrun ops. No PII, no sensitive data. This enables proactive incident detection without requiring VPN access.

VPN (inbound, on-demand): Speedrun ops team VPNs into customer monitoring dashboards for investigation and deep dives. WireGuard-based, deployed as part of the stack, with SSO authentication and short-lived sessions.

Data classification:

DataStays in VPCCan flow out
Agent logs (may contain client data)YesNo
LLM request/response contentYesNo
Metrics (CPU, memory, latency, error rates)YesAggregated health score only
Token usage countsYesAggregated totals (for billing)
Alert triggersYesAlert name + severity only
Traces (OpenTelemetry)YesNo

This is a configurable policy per client. Some clients may allow anonymized metrics export; others want nothing out.

Upgrade path: GitOps (ArgoCD/Flux) pointing at Speedrun's release channel enables rolling out new versions across customer VPC deployments without logging into each one. Clients approve and apply updates through the GitOps workflow.

Security Controls

Threat model and full attack surface analysis documented in Threat Model.

Tenant Isolation Enforcement

  • Database layer: Every query includes a tenant_id filter enforced by a query wrapper at the data access layer — not just application logic. No query can execute without tenant scoping.
  • Shared runtime: LLM Gateway and Agent Runtime flush all state between requests from different tenants. No context bleed.
  • Network: K8s network policies per namespace verified and tested. Tenant A's pods cannot reach Tenant B's resources.

Egress Filtering

  • Agents can only reach whitelisted external endpoints, configured per tenant and per vertical.
  • K8s network policies enforce egress restrictions — agents cannot open arbitrary outbound connections.
  • Tool Framework validates target URLs against the whitelist before executing external API calls.

Credential Lifecycle

  • Rotation: Automated Vault key rotation on schedule. Immediate rotation on suspected compromise.
  • Short-lived tokens preferred over long-lived API keys where providers support it (OAuth2 token refresh).
  • Anomaly detection: Usage spike on a key (e.g., 10x normal) triggers alert + auto-freeze pending review.
  • Blast radius: One compromised client key affects only that client's agents. Speedrun keys are separate.

Operator Access Controls

  • VPN sessions: Time-limited (4hr max), require ticket justification, logged with operator identity.
  • Vault audit: Every secret read logged — who, when, which secret, from where.
  • Database access: Via bastion host with session recording. No direct DB access from developer machines.
  • Separation of duties: No single person can both deploy code and access production secrets.