AI
为什么我们用基准测试来评估模型,却凭感觉把它们部署上线?
81
Opportunity
团队从排行榜上挑选模型,然后在生产环境中运行,几乎没有任何持续、低成本、针对特定任务的评估。当质量出现偏差时,没有人会察觉,直到用户投诉才发现。对大多数开发者而言,真正用于衡量 AI 功能是否仍然有效的工具根本不存在。
Why it matters
无法衡量的东西就无法有效运营,而当下大多数 AI 功能都缺乏有效的度量。
我如何评估机会
The Opportunity Score is my own read, not a measurement: how much it hurts, how often it bites, and how little exists to solve it today. Higher means I think it is more worth building.
严重性7/10
How much pain it causes when it shows up.
频率8/10
How often people actually run into it.
空白空间8/10
How little good tooling exists for it today.