Why do we test models on benchmarks but ship them on vibes?
Opportunity
Teams pick a model off a leaderboard, then run it in production with almost no continuous, cheap, task-specific evaluation. When quality drifts, nobody notices until a user complains. The tooling to actually measure whether your AI feature is still good is missing for most builders.
Why it matters
You cannot operate what you cannot measure, and right now most AI features are unmeasured.
How I score the opportunity
The Opportunity Score is my own read, not a measurement: how much it hurts, how often it bites, and how little exists to solve it today. Higher means I think it is more worth building.
How much pain it causes when it shows up.
How often people actually run into it.
How little good tooling exists for it today.
More problems worth solving
Why does every AI app forget me the moment I close the tab?
AIWhy is learning a new field still gated by knowing what to ask?
AIWhy can a non-expert not verify what an AI just told them?
AIWhy do AI agents have no memory of their own mistakes?
AIWhy can't I audit what a model was actually trained on?
AIWhy can a poisoned document silently exfiltrate everything my assistant knows about me?