How do I prove a model was trained on consented data without revealing the dataset?
Opportunity
Decentralized AI networks let anyone contribute compute or data to train a shared model, but there is no mechanism by which a downstream user or regulator can verify that the training corpus excluded poisoned, stolen, or unconsented data without the network revealing what it trained on. Data provenance today is either a signed manifest that contributors self-attest or a centralized audit that defeats the purpose of decentralization. A February 2025 paper on activation inversion attacks showed that training data can be partially reconstructed from gradient signals exchanged during federated training, which means any provenance scheme that requires sharing gradients also leaks data. The 2025 OWASP LLM top-ten explicitly lists supply-chain data poisoning as a category with no standardized mitigation for open, decentralized training runs.
Why it matters
Without verifiable data provenance, every model trained on a public decentralized network is a liability for any downstream application facing regulatory or copyright scrutiny.
How I score the opportunity
The Opportunity Score is my own read, not a measurement: how much it hurts, how often it bites, and how little exists to solve it today. Higher means I think it is more worth building.
How much pain it causes when it shows up.
How often people actually run into it.
How little good tooling exists for it today.
More problems worth solving
What does an AI agent's bank account actually look like?
AI x CryptoCan an on-chain organization run by agents avoid becoming a scam machine?
AI x CryptoHow do you prove a photo or a voice is real without a platform vouching for it?
AI x CryptoWhy is on-chain identity either nothing or your entire life?
AI x CryptoHow do I audit which agent acted under my identity across a delegation chain?
AI x CryptoHow do I verify that an AI agent holding my funds is actually solvent?