machine learning yearning 第六章

Your dev and test sets should come from the same distribution

你的開發集和測試集應該來自同一份

你把你手機app上的圖片按市場區域分爲4類:(1)中國圖片(2)美國圖片(3)印度圖片(4)其他。爲了設置一個開發集和測試集,我們隨機地把其中的2類作爲開發集,另外2類做爲測試集。假設來自美國和印度的圖片作爲開發集,中國和其他地區的圖片作爲測試集。這樣做可以嗎?答案是:大寫的錯!

You have your cat app image data segmented into four regions, based on your largest markets: (i) US, (ii) China, (iii) India, and (iv) Other. To come up with a dev set and a test set, we can randomly assign two of these segments to the dev set, and the other two to the test set, right? Say US and India in the dev set; China and Other in the test set. 

一旦你決定選擇的測試集和開發集,你的團隊就會花時間在提高開發集的效果上。因此,建議你一開始就選擇一個,能夠更好的放映你最需要提高的地方的開發集:在4類圖象都能運行得很好,而不僅僅是2類。

Once you define the dev and test sets, your team will be focused on improving dev set performance. Thus, the dev set should reflect the task you want most to improve on: To do well on all four geographies, and not only two. 

第二個問題在於:開發集和測試集來自不同的一份數據,就可能會導致你的團隊開發了一個在開發集上做得很好的模型,但是在測試集上測試時卻發現,效果實在太差了。我曾見過很多人爲此浪費了很多時間精力。千萬不要讓這種狀況發生在你身上。

There is a second problem with having different dev and test set distributions: There is a chance that your team will build something that works well on the dev set, only to find that it does poorly on the test set. I’ve seen this result in much frustration and wasted effort. Avoid letting this happen to you. 

舉個例子,假設你的團隊開發了一個在開發集上表現優越,但在測試集上卻不盡人意的系統。如果你的開發集和測試集來源相同,那麼你就能夠清楚的診斷這個系統哪裏出現了紕漏:你可能過度擬合了開發集。最簡單的解決方法就是尋找更多的開發集數據。

As an example, suppose your team develops a system that works well on the dev set but not the test set. If your dev and test sets had come from the same distribution, then you would have a very clear diagnosis of what went wrong: You have overfit the dev set. The obvious cure is to get more dev set data. 

但是如果你的開發集和測試集來源不同,那麼你可能就蒙圈了。出錯的地方可能有能多:

  1. 你過分擬合了開發集
  2. 你的測試集未必難度大,但就是與開發集差太遠了。所以一些在開發集上的工作在測試集上變現得並不滿意。出現這種情況,你就得花精力在提高開發集上了。
  3. 測試集比開發集難度更大。所以你的算法不能像預期的那樣做得很好。如何解決這一問題需要具體情況具體分析。

But if the dev and test sets come from different distributions, then your options are less clear. Several things could have gone wrong: 

  1. You had overfit to the dev set. 
  2. The test set is harder than the dev set. So your algorithm might be doing as well as could be expected, and there’s no further significant improvement is possible. 
  3. The test set is not necessarily harder, but just different, from the dev set. So what works well on the dev set just does not work well on the test set. In this case, a lot of your work to improve dev set performance might be wasted effort.  

研究機器學習的應用是有難度的。互不匹配的測試集和開發集將會干擾你決定到底該提高測試集的質量還是開發集的質量,甚至你會不知道到底是哪裏錯了,哪一個地方得優先解決。

Working on machine learning applications is hard enough. Having mismatched dev and test sets introduces additional uncertainty about whether improving on the dev set distribution also improves test set performance. Having mismatched dev and test sets makes it harder to figure out what is and isn’t working, and thus makes it harder to prioritize what to work on. 

如果你正在解決第三方基準問題,但是那邊的工程師卻指定訓練集和測試集的來源不同,那麼,這時候你得靠運氣了,投入多少技術反而收效不高。開發一個能夠衍生推廣到不同數據的集上,卻依舊錶現很好的學習算法是一個值得研究的問題。但是,如果你僅需要解決當前的燃眉之急而不是要搞科研的話,建議你選擇來自同一份數據的開發集和測試集,相信這樣你的團隊會更有效率。

If you are working on 3rd party benchmark problem, their creator might have specified dev and test sets that come from different distributions. Luck, rather than skill, will have a greater impact on your performance on such benchmarks compared to if the dev and test sets come from the same distribution. It is an important research problem to develop learning algorithms that’re trained on one distribution and generalize well to another. But if your goal is to make progress on a specific machine learning application rather than make research progress, I  recommend trying to choose dev and test sets that are drawn from the same distribution. This will make your team more efficient. 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章