When to change dev/test sets and metrics

什麼時候要改變開發集/測試集和評價標準呢？

當我在開啓一個新項目的時候，我會先選一個開發集和測試集，因爲它是我們團隊前進的燈塔（前幾章）。

When starting out on a new project, I try to quickly choose dev/test sets, since this gives the team a well-defined target to aim for.

我通常會讓我的隊友們在一個星期內給定一個初始的開發集和測試集，並且提出評價指標。提出一個不完善的開發集比暫時撂在一邊要好得多，也快得多。雖然，在一些成熟的應用領域，一週的時間顯然也是不合理的。例如，反垃圾郵件就是一個成熟的深度學習的應用領域，我見過很多團隊花費好幾個月的時間獲取一個更好的開發/測試集。

I typically ask my teams to come up with an initial dev/test set and an initial metric in less than one week—almost never longer. It it better to come up with something imperfect and get going quickly, rather than overthink this. But this one week timeline does not apply to mature applications. For example, anti-spam is a mature deep learning application. I have seen teams working on already-mature systems spend months to acquire even better dev/ test sets.

如果你之後察覺到你最初的開發集/測試集或者評價指標作用不大，那麼你應該馬上修改它們。例如，如果你的開發集和指標顯示出分類器A的效率在B之上，但你的團隊堅持認爲分類器B實際上更適合你的產品，這時候，你就該掂量一下是否應該修改你的開發/測試集或者你的評價指標了。

If you later realize that your initial dev/test set or metric missed the mark, by all means change them quickly. For example, if your dev set + metric ranks classifier A above classifier B, but your team thinks that classifier B is actually superior for your product, then this might be a sign that you need to change your dev/test sets or your evaluation metric.

造成以上現象的主要由三種原因：

There are three main possible causes of the dev set/metric incorrectly rating classifier A higher:

1.開發集和測試集的來源不同於你所真正想要實現的內容。

The actual distribution you need to do well on is different from the dev/test sets.

假設初始的開發集和測試集中成年的貓咪所佔比例較大。但你的app在運行的時候，卻發現大部分的用戶上傳的貓咪圖象是小貓，而不是預期的“大”貓。所以，開發集和測試集已經不具有代表性了。這種情況下，更新你的開發/測試集纔是王道。

Suppose your initial dev/test set had mainly pictures of adult cats. You ship your cat app, and find that users are uploading a lot more kitten images than expected. So, the dev/test set distribution is not representative of the actual distribution you need to do well on. In this case, update your dev/test sets to be more representative.

2.你過度擬合了開發集

2.You have overfit to the dev set.

反覆在開發集上提高算法的過程實際上也是逐漸“過度評估”的過程。如果你發現在開發集上模型的表現比在測試集上表現得更加優越，那麼很有可能是因爲你的模型過度擬合了。在這種情況下，你應該找到一個新的開發集。

The process of repeatedly evaluating ideas on the dev set causes your algorithm to gradually “overfit” to the dev set. When you are done developing, you will evaluate your system on the test set. If you find that your dev set performance is much better than your test set performance, it is a sign that you have overfit to the dev set. In this case, get a fresh dev set.

如果你需要查看你團隊的進度，你可以有規律地，例如每週或每個月，在測試集上評估一下你的系統。但是千萬不要依據測試集上的表現，做出某些關於算法的決定，包括是否返回使用上一週的系統。如果你這樣做，你這是在將模型過度擬合測試集了，並且這樣你也無法指望你能夠給出一個毫無偏差的對系統性能的估測了。（這一點在你需要發表研究論文，或是應用指標做決策時更顯重要）。

If you need to track your team’s progress, you can also evaluate your system regularly—say once per week or once per month—on the test set. But do not use the test set to make any decisions regarding the algorithm, including whether to roll back to the previous week’s system. If you do so, you will start to overfit to the test set, and can no longer count on it to give a completely unbiased estimate of your system’s performance (which you would need if you’re publishing research papers, or perhaps using this metric to make important business decisions).

3.指標所評價的，不是項目需要提高的東西。

3.The metric is measuring something other than what the project needs to optimize.

假如對於你的貓咪的app，你的評價指標是分類的準確率。這個指標顯示，分類器A比分類器B更優越。但是假如你親自嘗試了這兩個分類器，發現分類器A將一些18禁的圖片也放分進來了。儘管分類器A的準確率更高，但是讓18禁的圖片“潛”入的壞印象，讓你覺得分類器A的表現令人無法接受，你改怎麼辦？

Suppose that for your cat application, your metric is classification accuracy. This metric currently ranks classifier A as superior to classifier B. But suppose you try out both algorithms, and find classifier A is allowing occasional pornographic images to slip through. Even though classifier A is more accurate, the bad impression left by the occasional pornographic image means its performance is unacceptable. What do you do?

這裏，評估指標不能辨別分類器B比A實際上更適合你的產品。所以，你不再信任指標了。這時候就應該更改評價指標。例如，你可以讓指標懲罰那些讓18禁圖片進入的算法。強烈建議你選擇一個全新的指標，並用這個新的指標爲你的團隊明確一個新的目標，不要在你不信任的指標上越走越遠然後轉而自己手動選擇。

Here, the metric is failing to identify the fact that Algorithm B is in fact better than Algorithm A for your product. So, you can no longer trust the metric to pick the best algorithm. It is time to change evaluation metrics. For example, you can change the metric to heavily penalize letting through pornographic images. I would strongly recommend picking a new metric and using the new metric to explicitly define a new goal for the team, rather than proceeding for too long without a trusted metric and reverting to manually choosing among classifiers.

在開發的項目的時候，修改開發/測試集是很常見的。有一個初始的開發/測試集將會對你的開發效率大有所益。如果你發現，你的開發/測試集和評價指標不太起作用了，那麼修改它們吧，讓它們爲你們的團隊指引方向！

It is quite common to change dev/test sets or evaluation metrics during a project. Having an initial dev/test set and metric helps you iterate quickly. If you ever find that the dev/test sets or metric are no longer pointing your team in the right direction, it’s not a big deal! Just change them and make sure your team knows about the new direction.

machine learning yearning 第十一章

When to change dev/test sets and metrics

什麼時候要改變開發集/測試集和評價標準呢？

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

nodejs學習06——小案例

評估統計算法在銀行僞造鈔票檢測中的價值

Java ThreadPoolShutdown

5月21日相聚上海張江！與文心大模型一起共建大模型產業應用生態圈

通義千問 2.5 “客串” ChatGPT4，你分的清嗎？

雙變量的t檢驗

方差分析與單因素方差分析

配對變量t檢驗

參數檢驗之t檢驗

Dijkstra解決TSP問題

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結