Data Rights & Knowledge Sharing — Legal Risk Assessment

Research for Project Kaze

1. The Problem

Kaze's vertical flywheel depends on knowledge compounding across clients: agents learn from Client A's engagement, distill insights into shared vertical knowledge, and Client B's agents benefit. This is the core moat — accumulated vertical knowledge graphs that get richer with every client engagement.

The problem: this is potentially trade secret misappropriation and a GDPR violation.

When an SEO agent learns "keyword clustering with semantic grouping improved Client A's organic traffic by 40%", and that insight enters the shared vertical knowledge graph, Client B's agents benefit from Client A's competitive intelligence. In a litigation-rich environment, Client A could argue their business strategies are being used to benefit competitors.

2. Legal Landscape

GDPR (Regulation 2016/679):

Article 5(1)(b) — Purpose limitation: Data collected for "providing services to Client A" cannot be repurposed for "enriching a knowledge system that benefits all clients" without a new legal basis.
Article 6 — Lawful basis: Legitimate interest (Art. 6(1)(f)) is unlikely to cover this — the client's interest in confidentiality outweighs Speedrun's interest in knowledge enrichment. Consent (Art. 6(1)(a)) is the safest basis, but must be freely given, specific, and revocable.
Article 22 — Automated decision-making: If AI agents make decisions about client strategies based on other clients' data, this may trigger Art. 22 protections.
Note: GDPR primarily covers personal data. Business data (strategies, processes) may not be "personal data" unless it relates to identifiable individuals. But the boundary is fuzzy.

Trade Secrets Directive (2016/943):

Protects confidential business information with commercial value.
Article 4(2): Acquisition of a trade secret is unlawful if obtained through "any other conduct which, under the circumstances, is considered contrary to honest commercial practices."
Using one client's strategic data to benefit another, even in anonymized form, could be argued as contrary to honest commercial practices.
Article 3(1)(b): Lawful acquisition includes information obtained "through observation, study, disassembly or testing of a product or object that has been made available to the public." Client-internal strategies don't qualify.

Key risk: Even if data is anonymized, with small vertical populations (e.g., 5 SaaS companies doing SEO through Kaze in DACH region), patterns may be trivially re-identifiable. EU courts have held that the threshold for anonymization is high.

2.2 US — DTSA + State Laws

Defend Trade Secrets Act (2016) + Uniform Trade Secrets Act (48 states):

Trade secret: information that derives independent economic value from not being generally known, and is subject to reasonable efforts to maintain secrecy.
Misappropriation: Disclosure or use of a trade secret by someone who acquired it through improper means or under a duty to maintain its secrecy.
A service provider using confidential client data to benefit other clients is a classic misappropriation claim.
Even if the specific data isn't directly copied, "inevitable disclosure" doctrine (some states) could apply — if Speedrun's agents learned from Client A's strategies, they inevitably carry that knowledge into Client B's work.

Contractual risk:

Most B2B service agreements include confidentiality clauses.
Standard NDA language typically prohibits using confidential information for any purpose other than the agreed-upon service.
Using client data to enrich a shared knowledge base almost certainly exceeds standard NDA scope unless specifically carved out.

Litigation environment:

Trade secret claims are the fastest-growing category of IP litigation in the US.
Even a weak claim is expensive to defend: $2-5M average cost through trial.
The mere existence of a shared knowledge system enriched by client data is discoverable and would be a smoking gun in any trade secret case.

2.3 Other Jurisdictions

UK (post-Brexit): Similar to EU, Trade Secrets (Enforcement) Regulations 2018 mirrors the EU directive.
Singapore/ASEAN: Common law trade secret protection. Less regulatory, but contractual obligations apply.
Japan/Korea: Strong trade secret protections under Unfair Competition Prevention Acts.

3. How Knowledge Flows Today (The Risk Path)

Current architecture (per ai-native.md Section 4 and technical-design.md):

Client A engagement
  → Agent executes tasks using Client A's data
  → Agent observes patterns, learns strategies
  → Agent writes to knowledge system:
      ├── Client-private tier: client-specific facts (isolated, no risk)
      └── Shared vertical tier: distilled insights (THE RISK)
  → Quality gate reviews shared write
  → Insight enters shared vertical knowledge
  → Client B's agent retrieves this insight during task execution
  → Client B benefits from Client A's trade secrets

The quality gate (LLM-as-judge + cross-reference) checks for accuracy and relevance, not for data provenance or legal compliance. There is no mechanism today that distinguishes "knowledge derived from public sources" from "knowledge derived from client engagement."

4. Architectural Options

Option A: Strict Isolation (No Cross-Pollination)

Client-derived learnings never enter shared vertical knowledge. Full stop.

How it works:

Agents can only write to the client-private tier from client engagements.
Shared vertical knowledge is built exclusively from: public sources, published research, Speedrun's internal experiments (V0), and manually curated domain expertise.
The knowledge system enforces this at the write pipeline — any write originating from a client-context agent is automatically scoped to client-private.

Tradeoffs:

Pro	Con
Zero legal risk from cross-pollination	Kills the compounding flywheel
Simple to explain to clients and regulators	Vertical knowledge is static without manual curation
No anonymization complexity	No moat from accumulated client work
No consent management	Speedrun competes on execution speed only, not accumulated intelligence

When to choose: If legal counsel says the risk is unacceptable at any level, or if operating exclusively in strict jurisdictions (EU financial services, US healthcare).

Clients opt-in via contract clause. An anonymization pipeline transforms client-specific learnings into generic patterns.

How it works:

Service contract includes a specific "Knowledge Contribution" clause with:
- Clear description of what data flows and where
- Explicit consent (GDPR Art. 6(1)(a) basis)
- Right to withdraw at any time (with prospective effect)
- Guarantee that contributed data is anonymized/abstracted
Anonymization pipeline runs before shared knowledge entry:
- Strips client identifiers, specific numbers, dates, product names
- Abstracts to category level ("e-commerce SaaS" not "Client A's product")
- Generalizes metrics to ranges ("30-50% improvement" not "42.3%")
Quality gate expanded to include anonymization verification.

Tradeoffs:

Pro	Con
Preserves flywheel for consenting clients	Anonymization is genuinely hard — re-identification risk with small populations
Contractual consent is strong legal basis	Requires legal review per jurisdiction
Clients who contribute get richer knowledge	Pipeline adds complexity and potential failure mode
Aligns incentives (contribute to benefit)	Consent withdrawal requires retroactive scrubbing (complex)

Key risk: With 3 clients in a vertical in one region, "an e-commerce SaaS company in DACH saw 40% improvement with keyword clustering" is effectively identified. Minimum viable anonymity requires a critical mass of clients per vertical segment.

When to choose: If you're confident in the anonymization pipeline and have enough clients per vertical to prevent re-identification.

Option C: Aggregate-Only Learning (Statistical)

Only numerical/statistical patterns enter shared knowledge. Never specific strategies, approaches, or qualitative insights.

How it works:

The knowledge system tracks aggregate metrics across client engagements:
- Task success rates by skill type
- Tool effectiveness scores
- Workflow pattern efficiency comparisons
- Model performance by task category
These are purely statistical: "keyword gap analysis before content planning: 72% success rate (n=234 tasks)" — never "what keywords" or "what content."
No qualitative knowledge ("this strategy works") — only quantitative ("tasks using this pattern succeed N% more often").

Tradeoffs:

Pro	Con
Strong legal defensibility — statistics aren't trade secrets	Much less useful than qualitative knowledge
GDPR-friendly (no personal or business-specific data)	Can't share "what works" in rich detail
No anonymization needed — data is inherently abstract	Vertical knowledge is shallow
No consent complexity	The moat is weaker — competitors can replicate stats

When to choose: If you want a legally safe middle ground that still provides some flywheel effect, even if the knowledge is shallow.

Default is strict isolation. Opt-in tiers for clients who want to contribute and benefit. Speedrun-sourced knowledge always shared.

How it works:

Three knowledge source tiers coexist:

┌─────────────────────────────────────────────────────────────┐
│                VERTICAL KNOWLEDGE GRAPH                       │
│                                                               │
│  ┌─────────────────────┐  Always available to all clients    │
│  │  Speedrun-Sourced    │  • Public domain knowledge          │
│  │  (no client data)    │  • Speedrun research & benchmarks   │
│  │                      │  • V0 internal ops learnings        │
│  └─────────────────────┘                                      │
│                                                               │
│  ┌─────────────────────┐  Available to contributor-tier       │
│  │  Client-Contributed  │  clients only                       │
│  │  (anonymized,        │  • Anonymized strategy patterns     │
│  │   consented)         │  • Abstracted workflow insights     │
│  │                      │  • Aggregate + qualitative mix      │
│  └─────────────────────┘                                      │
│                                                               │
│  ┌─────────────────────┐  Visible only to the owning client  │
│  │  Client-Private      │  • Brand voice, preferences         │
│  │  (isolated)          │  • Client-specific strategies       │
│  │                      │  • Full-fidelity learnings          │
│  └─────────────────────┘                                      │
└─────────────────────────────────────────────────────────────┘

Client tiers:

Tier	What they see	What they contribute	Contractual basis
Standard (default)	Speedrun-sourced only	Nothing to shared pool	Standard service agreement
Contributor (opt-in)	Speedrun-sourced + contributed pool	Anonymized learnings (with pipeline)	Knowledge Contribution Addendum
VPC / Enterprise	Configurable (may opt into contributed pool or not)	Configurable per policy	Custom agreement

Pricing incentive: Contributor-tier clients could receive a discount or enhanced service level, aligning economic incentives with knowledge sharing.

Knowledge provenance classification — every entry tagged:

Source Class	Description	Legal basis
`public`	Public domain sources (docs, articles, APIs)	No restriction
`speedrun_internal`	From Speedrun's own operations (V0)	Speedrun's IP
`speedrun_research`	Funded research and benchmarking	Speedrun's IP
`client_contributed`	Anonymized from client engagement, with consent	Contractual consent (GDPR Art. 6(1)(a))
`client_private`	Client-specific, never shared	Confidential

ABAC enforcement:

Knowledge query resolves source class visibility based on: tenant consent tier + deployment mode + jurisdiction config.
A standard-tier client's agent query will never return client_contributed entries.
A contributor-tier client's agent sees public + speedrun_internal + speedrun_research + client_contributed.
Write pipeline enforces source class tagging — agents cannot write to client_contributed without the consent flag.

Consent withdrawal:

Client revokes consent → prospective effect (no new contributions).
Historical contributions remain (they're anonymized and abstracted). Contract clause specifies this.
If client demands retroactive removal: knowledge entries traceable via provenance chain → can be tombstoned. This is operationally expensive but architecturally possible.

Tradeoffs:

Pro	Con
Legal risk only for consenting clients	More complex knowledge access layer
Non-consenting clients fully isolated	Contributor pool may be small initially
Flywheel works (slower, subset contributes)	Anonymization pipeline still required
Incentive alignment via pricing	Per-entry provenance adds storage/query overhead
Speedrun's own V0 always feeds vertical knowledge	Consent withdrawal handling is complex
Architecturally supports all options — can change tier defaults later

5. Recommendation

Option D (Tiered Consent Model) is the chosen approach for Kaze.

Why:

Default-safe: Out of the box, every client is strictly isolated. Legal risk is zero until a client explicitly opts in.
Preserves the flywheel: Even a small contributor pool generates compounding knowledge. Speedrun's own V0 ops always contribute.
Legally grounded: Consent-based, per GDPR Art. 6(1)(a). Trade secret claims are countered by explicit contractual permission.
Architecturally future-proof: The provenance classification system supports changing the model later — if legal landscape shifts, tighten to Option A or C without rebuilding.
Business model alignment: Contributor tier can be a pricing lever, not just a legal mechanism.

Critical dependencies:

Legal counsel must draft the Knowledge Contribution Addendum before any client opts into contributor tier.
Anonymization pipeline quality must be validated before going live (minimum viable anonymity threshold defined per vertical).
Provenance classification is an MVP requirement for the knowledge system — it must be built from day one, even if the contributor tier launches later.

6. Impact on Architecture

Knowledge System Changes

Every knowledge entry gains a source_class field (from the 5 classes above).
ABAC rules expanded: source class + consent tier → visibility.
Write pipeline gains provenance tagging step (mandatory, not optional).
Anonymization pipeline component added to the write pipeline (for client_contributed class).

Product Strategy Changes

The flywheel description must clarify that knowledge compounding is tiered, not universal.
Contributor tier becomes a product/pricing decision, not just a technical one.

Threat Model Addition

New attack surface: consent bypass (agent writes to contributed tier without valid consent).
Mitigation: consent status is a platform-level flag on the tenant, not an agent-level parameter. Agents don't choose — the platform enforces.

7. Open Questions

#	Question	Impact
DR1	What is the minimum number of clients per vertical segment for anonymization to be legally defensible?	High — determines when contributor tier can launch
DR2	Can the Knowledge Contribution Addendum be a standard clause or does it need per-jurisdiction variants?	Medium — affects go-to-market speed
DR3	Should contributor-tier clients see each other's individual contributions, or only the aggregated pool?	Medium — affects knowledge system design
DR4	Does Speedrun need a Data Protection Officer (DPO) under GDPR if processing client data for knowledge enrichment?	Medium — regulatory compliance
DR5	How do we handle a client who contributes, then becomes a direct competitor to another contributor?	High — conflict of interest, potential litigation

Data Rights & Knowledge Sharing — Legal Risk Assessment ​

1. The Problem ​

2. Legal Landscape ​

2.1 EU — GDPR + Trade Secrets Directive ​

2.2 US — DTSA + State Laws ​

2.3 Other Jurisdictions ​

3. How Knowledge Flows Today (The Risk Path) ​

4. Architectural Options ​

Option A: Strict Isolation (No Cross-Pollination) ​

Option B: Explicit Consent + Anonymization Pipeline ​

Option C: Aggregate-Only Learning (Statistical) ​

Option D: Tiered Consent Model (Chosen) ​

5. Recommendation ​

6. Impact on Architecture ​

Knowledge System Changes ​

Product Strategy Changes ​

Threat Model Addition ​

7. Open Questions ​

Data Rights & Knowledge Sharing — Legal Risk Assessment

1. The Problem

2. Legal Landscape

2.1 EU — GDPR + Trade Secrets Directive

2.2 US — DTSA + State Laws

2.3 Other Jurisdictions

3. How Knowledge Flows Today (The Risk Path)

4. Architectural Options

Option A: Strict Isolation (No Cross-Pollination)

Option B: Explicit Consent + Anonymization Pipeline

Option C: Aggregate-Only Learning (Statistical)

Option D: Tiered Consent Model (Chosen)

5. Recommendation

6. Impact on Architecture

Knowledge System Changes

Product Strategy Changes

Threat Model Addition

7. Open Questions