Data Rights & Knowledge Sharing — Legal Risk Assessment
Research for Project Kaze
1. The Problem
Kaze's vertical flywheel depends on knowledge compounding across clients: agents learn from Client A's engagement, distill insights into shared vertical knowledge, and Client B's agents benefit. This is the core moat — accumulated vertical knowledge graphs that get richer with every client engagement.
The problem: this is potentially trade secret misappropriation and a GDPR violation.
When an SEO agent learns "keyword clustering with semantic grouping improved Client A's organic traffic by 40%", and that insight enters the shared vertical knowledge graph, Client B's agents benefit from Client A's competitive intelligence. In a litigation-rich environment, Client A could argue their business strategies are being used to benefit competitors.
2. Legal Landscape
2.1 EU — GDPR + Trade Secrets Directive
GDPR (Regulation 2016/679):
- Article 5(1)(b) — Purpose limitation: Data collected for "providing services to Client A" cannot be repurposed for "enriching a knowledge system that benefits all clients" without a new legal basis.
- Article 6 — Lawful basis: Legitimate interest (Art. 6(1)(f)) is unlikely to cover this — the client's interest in confidentiality outweighs Speedrun's interest in knowledge enrichment. Consent (Art. 6(1)(a)) is the safest basis, but must be freely given, specific, and revocable.
- Article 22 — Automated decision-making: If AI agents make decisions about client strategies based on other clients' data, this may trigger Art. 22 protections.
- Note: GDPR primarily covers personal data. Business data (strategies, processes) may not be "personal data" unless it relates to identifiable individuals. But the boundary is fuzzy.
Trade Secrets Directive (2016/943):
- Protects confidential business information with commercial value.
- Article 4(2): Acquisition of a trade secret is unlawful if obtained through "any other conduct which, under the circumstances, is considered contrary to honest commercial practices."
- Using one client's strategic data to benefit another, even in anonymized form, could be argued as contrary to honest commercial practices.
- Article 3(1)(b): Lawful acquisition includes information obtained "through observation, study, disassembly or testing of a product or object that has been made available to the public." Client-internal strategies don't qualify.
Key risk: Even if data is anonymized, with small vertical populations (e.g., 5 SaaS companies doing SEO through Kaze in DACH region), patterns may be trivially re-identifiable. EU courts have held that the threshold for anonymization is high.
2.2 US — DTSA + State Laws
Defend Trade Secrets Act (2016) + Uniform Trade Secrets Act (48 states):
- Trade secret: information that derives independent economic value from not being generally known, and is subject to reasonable efforts to maintain secrecy.
- Misappropriation: Disclosure or use of a trade secret by someone who acquired it through improper means or under a duty to maintain its secrecy.
- A service provider using confidential client data to benefit other clients is a classic misappropriation claim.
- Even if the specific data isn't directly copied, "inevitable disclosure" doctrine (some states) could apply — if Speedrun's agents learned from Client A's strategies, they inevitably carry that knowledge into Client B's work.
Contractual risk:
- Most B2B service agreements include confidentiality clauses.
- Standard NDA language typically prohibits using confidential information for any purpose other than the agreed-upon service.
- Using client data to enrich a shared knowledge base almost certainly exceeds standard NDA scope unless specifically carved out.
Litigation environment:
- Trade secret claims are the fastest-growing category of IP litigation in the US.
- Even a weak claim is expensive to defend: $2-5M average cost through trial.
- The mere existence of a shared knowledge system enriched by client data is discoverable and would be a smoking gun in any trade secret case.
2.3 Other Jurisdictions
- UK (post-Brexit): Similar to EU, Trade Secrets (Enforcement) Regulations 2018 mirrors the EU directive.
- Singapore/ASEAN: Common law trade secret protection. Less regulatory, but contractual obligations apply.
- Japan/Korea: Strong trade secret protections under Unfair Competition Prevention Acts.
3. How Knowledge Flows Today (The Risk Path)
Current architecture (per ai-native.md Section 4 and technical-design.md):
Client A engagement
→ Agent executes tasks using Client A's data
→ Agent observes patterns, learns strategies
→ Agent writes to knowledge system:
├── Client-private tier: client-specific facts (isolated, no risk)
└── Shared vertical tier: distilled insights (THE RISK)
→ Quality gate reviews shared write
→ Insight enters shared vertical knowledge
→ Client B's agent retrieves this insight during task execution
→ Client B benefits from Client A's trade secretsThe quality gate (LLM-as-judge + cross-reference) checks for accuracy and relevance, not for data provenance or legal compliance. There is no mechanism today that distinguishes "knowledge derived from public sources" from "knowledge derived from client engagement."
4. Architectural Options
Option A: Strict Isolation (No Cross-Pollination)
Client-derived learnings never enter shared vertical knowledge. Full stop.
How it works:
- Agents can only write to the client-private tier from client engagements.
- Shared vertical knowledge is built exclusively from: public sources, published research, Speedrun's internal experiments (V0), and manually curated domain expertise.
- The knowledge system enforces this at the write pipeline — any write originating from a client-context agent is automatically scoped to client-private.
Tradeoffs:
| Pro | Con |
|---|---|
| Zero legal risk from cross-pollination | Kills the compounding flywheel |
| Simple to explain to clients and regulators | Vertical knowledge is static without manual curation |
| No anonymization complexity | No moat from accumulated client work |
| No consent management | Speedrun competes on execution speed only, not accumulated intelligence |
When to choose: If legal counsel says the risk is unacceptable at any level, or if operating exclusively in strict jurisdictions (EU financial services, US healthcare).
Option B: Explicit Consent + Anonymization Pipeline
Clients opt-in via contract clause. An anonymization pipeline transforms client-specific learnings into generic patterns.
How it works:
- Service contract includes a specific "Knowledge Contribution" clause with:
- Clear description of what data flows and where
- Explicit consent (GDPR Art. 6(1)(a) basis)
- Right to withdraw at any time (with prospective effect)
- Guarantee that contributed data is anonymized/abstracted
- Anonymization pipeline runs before shared knowledge entry:
- Strips client identifiers, specific numbers, dates, product names
- Abstracts to category level ("e-commerce SaaS" not "Client A's product")
- Generalizes metrics to ranges ("30-50% improvement" not "42.3%")
- Quality gate expanded to include anonymization verification.
Tradeoffs:
| Pro | Con |
|---|---|
| Preserves flywheel for consenting clients | Anonymization is genuinely hard — re-identification risk with small populations |
| Contractual consent is strong legal basis | Requires legal review per jurisdiction |
| Clients who contribute get richer knowledge | Pipeline adds complexity and potential failure mode |
| Aligns incentives (contribute to benefit) | Consent withdrawal requires retroactive scrubbing (complex) |
Key risk: With 3 clients in a vertical in one region, "an e-commerce SaaS company in DACH saw 40% improvement with keyword clustering" is effectively identified. Minimum viable anonymity requires a critical mass of clients per vertical segment.
When to choose: If you're confident in the anonymization pipeline and have enough clients per vertical to prevent re-identification.
Option C: Aggregate-Only Learning (Statistical)
Only numerical/statistical patterns enter shared knowledge. Never specific strategies, approaches, or qualitative insights.
How it works:
- The knowledge system tracks aggregate metrics across client engagements:
- Task success rates by skill type
- Tool effectiveness scores
- Workflow pattern efficiency comparisons
- Model performance by task category
- These are purely statistical: "keyword gap analysis before content planning: 72% success rate (n=234 tasks)" — never "what keywords" or "what content."
- No qualitative knowledge ("this strategy works") — only quantitative ("tasks using this pattern succeed N% more often").
Tradeoffs:
| Pro | Con |
|---|---|
| Strong legal defensibility — statistics aren't trade secrets | Much less useful than qualitative knowledge |
| GDPR-friendly (no personal or business-specific data) | Can't share "what works" in rich detail |
| No anonymization needed — data is inherently abstract | Vertical knowledge is shallow |
| No consent complexity | The moat is weaker — competitors can replicate stats |
When to choose: If you want a legally safe middle ground that still provides some flywheel effect, even if the knowledge is shallow.
Option D: Tiered Consent Model (Chosen)
Default is strict isolation. Opt-in tiers for clients who want to contribute and benefit. Speedrun-sourced knowledge always shared.
How it works:
Three knowledge source tiers coexist:
┌─────────────────────────────────────────────────────────────┐
│ VERTICAL KNOWLEDGE GRAPH │
│ │
│ ┌─────────────────────┐ Always available to all clients │
│ │ Speedrun-Sourced │ • Public domain knowledge │
│ │ (no client data) │ • Speedrun research & benchmarks │
│ │ │ • V0 internal ops learnings │
│ └─────────────────────┘ │
│ │
│ ┌─────────────────────┐ Available to contributor-tier │
│ │ Client-Contributed │ clients only │
│ │ (anonymized, │ • Anonymized strategy patterns │
│ │ consented) │ • Abstracted workflow insights │
│ │ │ • Aggregate + qualitative mix │
│ └─────────────────────┘ │
│ │
│ ┌─────────────────────┐ Visible only to the owning client │
│ │ Client-Private │ • Brand voice, preferences │
│ │ (isolated) │ • Client-specific strategies │
│ │ │ • Full-fidelity learnings │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘Client tiers:
| Tier | What they see | What they contribute | Contractual basis |
|---|---|---|---|
| Standard (default) | Speedrun-sourced only | Nothing to shared pool | Standard service agreement |
| Contributor (opt-in) | Speedrun-sourced + contributed pool | Anonymized learnings (with pipeline) | Knowledge Contribution Addendum |
| VPC / Enterprise | Configurable (may opt into contributed pool or not) | Configurable per policy | Custom agreement |
Pricing incentive: Contributor-tier clients could receive a discount or enhanced service level, aligning economic incentives with knowledge sharing.
Knowledge provenance classification — every entry tagged:
| Source Class | Description | Legal basis |
|---|---|---|
public | Public domain sources (docs, articles, APIs) | No restriction |
speedrun_internal | From Speedrun's own operations (V0) | Speedrun's IP |
speedrun_research | Funded research and benchmarking | Speedrun's IP |
client_contributed | Anonymized from client engagement, with consent | Contractual consent (GDPR Art. 6(1)(a)) |
client_private | Client-specific, never shared | Confidential |
ABAC enforcement:
- Knowledge query resolves source class visibility based on: tenant consent tier + deployment mode + jurisdiction config.
- A standard-tier client's agent query will never return
client_contributedentries. - A contributor-tier client's agent sees
public+speedrun_internal+speedrun_research+client_contributed. - Write pipeline enforces source class tagging — agents cannot write to
client_contributedwithout the consent flag.
Consent withdrawal:
- Client revokes consent → prospective effect (no new contributions).
- Historical contributions remain (they're anonymized and abstracted). Contract clause specifies this.
- If client demands retroactive removal: knowledge entries traceable via provenance chain → can be tombstoned. This is operationally expensive but architecturally possible.
Tradeoffs:
| Pro | Con |
|---|---|
| Legal risk only for consenting clients | More complex knowledge access layer |
| Non-consenting clients fully isolated | Contributor pool may be small initially |
| Flywheel works (slower, subset contributes) | Anonymization pipeline still required |
| Incentive alignment via pricing | Per-entry provenance adds storage/query overhead |
| Speedrun's own V0 always feeds vertical knowledge | Consent withdrawal handling is complex |
| Architecturally supports all options — can change tier defaults later |
5. Recommendation
Option D (Tiered Consent Model) is the chosen approach for Kaze.
Why:
- Default-safe: Out of the box, every client is strictly isolated. Legal risk is zero until a client explicitly opts in.
- Preserves the flywheel: Even a small contributor pool generates compounding knowledge. Speedrun's own V0 ops always contribute.
- Legally grounded: Consent-based, per GDPR Art. 6(1)(a). Trade secret claims are countered by explicit contractual permission.
- Architecturally future-proof: The provenance classification system supports changing the model later — if legal landscape shifts, tighten to Option A or C without rebuilding.
- Business model alignment: Contributor tier can be a pricing lever, not just a legal mechanism.
Critical dependencies:
- Legal counsel must draft the Knowledge Contribution Addendum before any client opts into contributor tier.
- Anonymization pipeline quality must be validated before going live (minimum viable anonymity threshold defined per vertical).
- Provenance classification is an MVP requirement for the knowledge system — it must be built from day one, even if the contributor tier launches later.
6. Impact on Architecture
Knowledge System Changes
- Every knowledge entry gains a
source_classfield (from the 5 classes above). - ABAC rules expanded: source class + consent tier → visibility.
- Write pipeline gains provenance tagging step (mandatory, not optional).
- Anonymization pipeline component added to the write pipeline (for
client_contributedclass).
Product Strategy Changes
- The flywheel description must clarify that knowledge compounding is tiered, not universal.
- Contributor tier becomes a product/pricing decision, not just a technical one.
Threat Model Addition
- New attack surface: consent bypass (agent writes to contributed tier without valid consent).
- Mitigation: consent status is a platform-level flag on the tenant, not an agent-level parameter. Agents don't choose — the platform enforces.
7. Open Questions
| # | Question | Impact |
|---|---|---|
| DR1 | What is the minimum number of clients per vertical segment for anonymization to be legally defensible? | High — determines when contributor tier can launch |
| DR2 | Can the Knowledge Contribution Addendum be a standard clause or does it need per-jurisdiction variants? | Medium — affects go-to-market speed |
| DR3 | Should contributor-tier clients see each other's individual contributions, or only the aggregated pool? | Medium — affects knowledge system design |
| DR4 | Does Speedrun need a Data Protection Officer (DPO) under GDPR if processing client data for knowledge enrichment? | Medium — regulatory compliance |
| DR5 | How do we handle a client who contributes, then becomes a direct competitor to another contributor? | High — conflict of interest, potential litigation |