How large do the dev/test sets need to be?

測試集和開發集應該多大？

開發集要大到能夠區別算法之間的差異爲止。例如，一個分類器A的精確度爲90%，B的精確度爲90.1%，那麼只有100個的開發集是不能夠顯示出那0.1%的差距的。與其他很多作者見識過的機器學習問題相比，一個只包含100個樣本的開發集遠遠不夠。常見的開發集應該包含1000到10,000個樣本左右。有了10,000個樣本，發現0.1%的區別就不成問題。

The dev set should be large enough to detect differences between algorithms that you are trying out. For example, if classifier A has an accuracy of 90.0% and classifier B has an accuracy of 90.1%, then a dev set of 100 examples would not be able to detect this 0.1% difference. Compared to other machine learning problems I’ve seen, a 100 example dev set is small. Dev sets with sizes from 1,000 to 10,000 examples are common. With 10,000 examples, you will have a good chance of detecting an improvement of 0.1%.[1]

至於一些成熟的應用，例如：廣告推送，網頁搜索和產品推薦，我見過很多團隊，都很努力地提高他們的產品。哪怕只有0.01%的提高，都會給公司帶來巨大利益。這時候，爲了發現那近似於0.01%的提升效果，開發集就必須遠大於1W了。

For mature and important applications—for example, advertising, web search, and product recommendations—I have also seen teams that are highly motivated to eke out even a 0.01% improvement, since it has a direct impact on the company’s profits. In this case, the dev set could much larger than 10,000, in order to detect even smaller improvements.

那麼測試集呢？測試集必須達到能夠覆蓋你的系統的所有要測試的功能。通常我們都試探性的用我們總數據的30%用來當做測試集，這在你的樣本只有100到1W時，效果良好。但是在大數據的時代，我們的數據規模超過了10億個樣本。用於開發集和測試集的數據之比：測試集總數與開發集的總數之比，即dev / test的值，正在不斷下降，即使兩者的數據量不斷增長。所以，測試集的數據集必須大於能夠測試你係統的程度。

How about the size of the test set? It should be large enough to give high confidence in the overall performance of your system. One popular heuristic had been to use 30% of your data for your test set. This works well when you have a modest number of examples—say 100 to 10,000 examples. But in the era of big data where we now have machine learning problems with sometimes more than a billion examples, the fraction of data allocated to dev/test sets has been shrinking, even as the absolute number of examples in the dev/test sets has been growing. There is no need to have excessively large dev/test beyond what is needed to evaluate the performance of your algorithms.  

人能力有限，如有錯誤歡迎改正，希望不吝賜教。

——譯者：wexin_42141390 郵箱：[email protected]

[1]理論上，我們也可以用來測試：修改一下算法是否對開發集有統計意義上的提高。但事實上，沒有團隊會這麼做（除非他們要發表論文），我也不認爲這種測試對中期進展很有用。

In theory, one could also test if a change to an algorithm makes a statistically significant difference on the dev set. In practice, most teams don’t bother with this (unless they are publishing academic research papers), and I usually do not find statistical significance tests useful for measuring interim progress.

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

machine learning yearning 第七章

How large do the dev/test sets need to be?

測試集和開發集應該多大？

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

評估統計算法在銀行僞造鈔票檢測中的價值

Java ThreadPoolShutdown

5月21日相聚上海張江！與文心大模型一起共建大模型產業應用生態圈

通義千問 2.5 “客串” ChatGPT4，你分的清嗎？

“她”來了，陪伴賽道鉅變！爲GPT-4o加上你的一個數字分身

雙變量的t檢驗

方差分析與單因素方差分析

配對變量t檢驗

參數檢驗之t檢驗

Dijkstra解決TSP問題

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結