AI x Crypto

How do I prove a model was trained on consented data without revealing the dataset?

Opportunity

Decentralized AI networks let anyone contribute compute or data to train a shared model, but there is no mechanism by which a downstream user or regulator can verify that the training corpus excluded poisoned, stolen, or unconsented data without the network revealing what it trained on. Data provenance today is either a signed manifest that contributors self-attest or a centralized audit that defeats the purpose of decentralization. A February 2025 paper on activation inversion attacks showed that training data can be partially reconstructed from gradient signals exchanged during federated training, which means any provenance scheme that requires sharing gradients also leaks data. The 2025 OWASP LLM top-ten explicitly lists supply-chain data poisoning as a category with no standardized mitigation for open, decentralized training runs.

Why it matters

Without verifiable data provenance, every model trained on a public decentralized network is a liability for any downstream application facing regulatory or copyright scrutiny.

How I score the opportunity

The Opportunity Score is my own read, not a measurement: how much it hurts, how often it bites, and how little exists to solve it today. Higher means I think it is more worth building.

Severity8/10

How much pain it causes when it shows up.

Frequency7/10

How often people actually run into it.

Whitespace9/10

How little good tooling exists for it today.

How do I prove a model was trained on consented data without revealing the dataset?

How I score the opportunity

More problems worth solving