Why can I not trust a model's confidence score when it matters most?
Opportunity
Modern language models routinely output high-confidence tokens on wrong answers and low-confidence tokens on correct ones. The gap between stated probability and actual accuracy, called calibration error, has been documented across frontier models in a 2025 survey covering entropy, logit, and perturbation based methods. Production agents that use these scores to decide when to defer or abstain inherit the miscalibration directly, so they either hallucinate forward with false certainty or refuse correct answers unnecessarily. No off-the-shelf primitive gives a calibrated, actionable uncertainty signal cheap enough to run at inference time on every output token in a streaming response.
Why it matters
Calibration is the trust primitive under every agentic decision, and without it every downstream safety threshold rests on sand.
How I score the opportunity
The Opportunity Score is my own read, not a measurement: how much it hurts, how often it bites, and how little exists to solve it today. Higher means I think it is more worth building.
How much pain it causes when it shows up.
How often people actually run into it.
How little good tooling exists for it today.
More problems worth solving
Why does every AI app forget me the moment I close the tab?
AIWhy is learning a new field still gated by knowing what to ask?
AIWhy can a non-expert not verify what an AI just told them?
AIWhy do we test models on benchmarks but ship them on vibes?
AIWhy do AI agents have no memory of their own mistakes?
AIWhy can't I audit what a model was actually trained on?