Skip to content
AI

为什么我们用基准测试来评估模型,却凭感觉把它们部署上线?

81

Opportunity

团队从排行榜上挑选模型,然后在生产环境中运行,几乎没有任何持续、低成本、针对特定任务的评估。当质量出现偏差时,没有人会察觉,直到用户投诉才发现。对大多数开发者而言,真正用于衡量 AI 功能是否仍然有效的工具根本不存在。

Why it matters

无法衡量的东西就无法有效运营,而当下大多数 AI 功能都缺乏有效的度量。

我如何评估机会

The Opportunity Score is my own read, not a measurement: how much it hurts, how often it bites, and how little exists to solve it today. Higher means I think it is more worth building.

严重性7/10

How much pain it causes when it shows up.

频率8/10

How often people actually run into it.

空白空间8/10

How little good tooling exists for it today.

更多值得解决的问题