为什么我们用基准测试来评估模型，却凭感觉把它们部署上线？

Opportunity

团队从排行榜上挑选模型，然后在生产环境中运行，几乎没有任何持续、低成本、针对特定任务的评估。当质量出现偏差时，没有人会察觉，直到用户投诉才发现。对大多数开发者而言，真正用于衡量 AI 功能是否仍然有效的工具根本不存在。

Why it matters

无法衡量的东西就无法有效运营，而当下大多数 AI 功能都缺乏有效的度量。

我如何评估机会

The Opportunity Score is my own read, not a measurement: how much it hurts, how often it bites, and how little exists to solve it today. Higher means I think it is more worth building.

严重性7/10

How much pain it causes when it shows up.

频率8/10

How often people actually run into it.

空白空间8/10

How little good tooling exists for it today.

为什么我们用基准测试来评估模型，却凭感觉把它们部署上线？

我如何评估机会

更多值得解决的问题