System Design
Part of Project Kaze Architecture
Architectural Pattern: Agent-Oriented Architecture
Kaze follows an Agent-Oriented Architecture — a hybrid pattern that borrows from several traditional styles but is fundamentally shaped by the fact that its primary units of computation are intelligent, autonomous agents, not passive services.
Borrowed patterns and their roles in Kaze:
| Pattern | What Kaze borrows | Applied where |
|---|---|---|
| Actor Model | Autonomous entities with private state, message-passing, supervision trees | Agent runtime — each agent is an actor |
| Event-Driven Architecture | Loose coupling via async events, event sourcing for audit | Inter-agent communication via NATS |
| Microservices | Independent deployment, own-your-data, clean API boundaries | Platform services (LLM Gateway, Knowledge Graph, etc.) |
| Cell-Based Architecture | Self-contained isolated deployment units | Each tenant/VPC is a cell |
New to Kaze (no traditional equivalent):
- Components that learn and self-modify their behavior over time
- A governance hierarchy where AI agents supervise other AI agents
- Shared knowledge across agents while maintaining runtime isolation
- A supervision ramp (supervised → sampling → autonomous) as a trust model
Layer Architecture
Kaze is organized into 5 layers, from infrastructure at the bottom to governance at the top:
┌─────────────────────────────────────────────────────────────┐
│ KAZE PLATFORM │
│ │
│ Layer 3: GOVERNANCE & SELF-IMPROVEMENT │
│ ┌────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ Supervisor │ │ Quality │ │ Improvement Agent │ │
│ │ Agents │ │ Monitor │ │ (prompt/skill/workflow │ │
│ │ │ │ Agent │ │ optimization) │ │
│ └────────────┘ └──────────────┘ └────────────────────────┘ │
│ │
│ Layer 2: ORCHESTRATION & KNOWLEDGE │
│ ┌──────────────┐ ┌───────────────────────────────────────┐ │
│ │ Orchestrator │ │ Shared Knowledge Graph │ │
│ │ Agents │ │ ├── Vertical knowledge (SEO, CRM..) │ │
│ │ (dynamic │ │ ├── Cross-vertical patterns │ │
│ │ planning) │ │ └── Client-specific context │ │
│ └──────────────┘ └───────────────────────────────────────┘ │
│ │
│ Layer 1: EXECUTION │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Agent Skills (composable, reusable per vertical) ││
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ ││
│ │ │ Keyword │ │ Content │ │ Lead │ │ Report │ ││
│ │ │ Research │ │ Optimize │ │ Scoring │ │ Generator │ ││
│ │ └──────────┘ └──────────┘ └──────────┘ └────────────┘ ││
│ └─────────────────────────────────────────────────────────┘│
│ │
│ Layer 0.5: INTERACTION │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Conversation Manager │ │
│ │ ┌───────┐ ┌───────┐ ┌──────────┐ ┌────────┐ │ │
│ │ │ Slack │ │ Email │ │ WhatsApp │ │Telegram│ ... │ │
│ │ └───────┘ └───────┘ └──────────┘ └────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Layer 0: PLATFORM INFRASTRUCTURE │
│ ┌──────────┐ ┌──────┐ ┌─────────┐ ┌──────┐ ┌───────────┐ │
│ │ K8s │ │ NATS │ │Postgres │ │Vault │ │ OTel/Prom │ │
│ │ │ │ │ │ │ │ │ │ Grafana │ │
│ └──────────┘ └──────┘ └─────────┘ └──────┘ └───────────┘ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ LLM Gateway (multi-provider, dual-key, budget mgmt) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ALL CONTAINERIZED · ALL IaC · ANY CLOUD · ANY VPC │
└─────────────────────────────────────────────────────────────┘Layer Descriptions
Layer 0: Platform Infrastructure
Non-AI infrastructure that provides the runtime foundation. This is traditional software — deterministic, well-understood, battle-tested.
Components:
- Kubernetes — Universal compute runtime. Provides scheduling, scaling, networking, and the deployment abstraction across any cloud.
- NATS — Lightweight message bus for all inter-agent communication. Supports pub/sub, request/reply, and persistent streaming (JetStream). Chosen for portability and minimal operational overhead.
- PostgreSQL — Primary relational datastore. Managed via CloudNativePG operator for Kubernetes-native operation.
- HashiCorp Vault — Secrets management. Stores LLM API keys (both Speedrun-owned and client-provided), agent credentials, and encryption keys.
- OpenTelemetry + Prometheus + Grafana + Loki — Full observability stack. OTel for distributed tracing, Prometheus for metrics, Grafana for visualization, Loki for log aggregation.
- LLM Gateway — Abstraction layer between agents and LLM providers (see Agent Model).
Layer 0.5: Interaction Layer
The multi-channel communication layer that enables humans to interact with agents naturally through their existing tools.
Components:
- Conversation Manager — Maintains unified conversation threads across channels. An agent doesn't "think in Slack" or "think in email" — it thinks in tasks and conversations. The channel is a delivery mechanism.
- Channel Adapters — Slack bot, Email agent, WhatsApp agent, Telegram bot, and future integrations. Each adapter translates between the channel's protocol and the Conversation Manager's unified format.
- Approval Flow Engine — Routes approval requests to the appropriate channel based on context, urgency, and client preference.
- Context Persistence — Maintains conversation history across channels so that context flows naturally (e.g., a question asked on WhatsApp can reference a document sent via email).
Layer 1: Execution
Agents that perform actual work. These are composed from reusable skills and operate within a specific vertical.
Key concepts:
- Skills — The atomic reusable unit of agent capability (see Agent Model).
- Agents — Compositions of skills + a role + context. An agent is instantiated from a template, bound to a client, and assigned a supervision level.
- Agent Runtime — The execution environment that hosts agents. Based on the actor model — each agent has private state, a message inbox, and processes one task at a time.
Layer 2: Orchestration & Knowledge
Agents that plan, decompose, and coordinate work across execution agents. Also hosts the shared knowledge graph.
Key concepts:
- Orchestrator Agents — Receive goals, decompose them into subtasks, assign to worker agents. Unlike static DAG workflows, orchestrators reason dynamically about the best approach and can re-plan at runtime if steps fail.
- Router Agents — Direct incoming requests to the appropriate agent or workflow based on intent classification.
- Shared Knowledge Graph — The persistent knowledge layer that agents read from and contribute to (see AI-Native).
Layer 3: Governance & Self-Improvement
Meta-agents that monitor, evaluate, and improve the entire system. This layer is the last to become autonomous and carries the most conservative guardrails.
Key concepts:
- Supervisor Agents — Watch agent fleet health, detect failures, take corrective action. Unlike traditional monitoring that follows static rules, supervisors reason about novel failure modes.
- Quality Monitor Agents — Evaluate agent outputs for quality, catch hallucinations or drift, score task completion. Feed results into the supervision ramp.
- Improvement Agents — Analyze execution patterns, propose prompt refinements, skill updates, and workflow optimizations. All changes go through canary deployment before full rollout.
Agent Hierarchy Summary
| Layer | Role | What lives here | Autonomy level |
|---|---|---|---|
| Layer 3 | Governance | Supervisor, Quality Monitor, Improvement agents | Last to become autonomous — human oversight longest |
| Layer 2 | Orchestration | Orchestrator agents, Router agents, Knowledge Graph | Second to autonomy |
| Layer 1 | Execution | Worker agents composed of skills | First to go autonomous (per skill, per vertical) |
| Layer 0.5 | Interaction | Conversation Manager, Channel adapters | N/A (infrastructure) |
| Layer 0 | Infrastructure | K8s, NATS, Postgres, Vault, LLM Gateway | N/A (traditional software) |