Your development and test sets

開發與測試集

讓我們會到喵喵圖片的案例：你運營者一個手機app，用戶向你的app上上傳各種圖片，但是你要讓機器自動找到喵咪的圖片。

Lets return to our earlier cat pictures example: You run a mobile app, and users are uploading pictures of many different things to your app. You want to automatically find the cat pictures.

你的團隊從不同網頁中下載了大量的圖片作爲訓練集，包括有喵咪的圖片（陽性樣本）和沒有喵咪的圖片（陰性樣本），並且其中的70%用於當訓練集和30%用來測試。使用這些數據，你的隊友建立了一個在測試集上表現良好，算得上經得住”測試“的分類器。但當你將這個分類器部署到你的手機app上，你嚇了一跳，因爲它在實際工作中的表現實在太差了！

Your team gets a large training set by downloading pictures of cats (positive examples) and non-cats (negative examples) off different websites. They split the dataset 70%/30% into training and test sets. Using this data, they build a cat detector that works well on the training and test sets.

But when you deploy this classifier into the mobile app, you find that the performance is really poor!

那麼...發生了什麼呢？

What happened?

你發現用戶上傳的圖片跟你在網上下載的並作爲訓練集的圖片有所差別：用戶上傳的圖片通常是用手機拍攝的，所以分辨率較低、相對模糊並且缺少理想照明。因爲你的訓練集大多是容網上搜來的，所以你的算法並不能很好的解決你實際關心的問題，即識別手機圖片。

You figure out that the pictures users are uploading have a different look than the website images that make up your training set: Users are uploading pictures taken with mobile phones, which tend to be lower resolution, blurrier, and have less ideal lighting. Since your training/test sets were made of website images, your algorithm did not generalize well to the actual distribution you care about of smartphone pictures.

在大數據之前，機器學習隨機地按70%和30%的比例分配訓練集和測試集是業內的一個通用法則。雖然從網上收集的這些圖片能發揮效用，但是在越來越多的應用中，這樣的做法，即選取訓練集與實際情景有偏頗的做法（例如上例中的網頁圖片），最終可能導致無法“命中”目標（上例中的手機圖片）。因此是不可取的。

Before the modern era of big data, it was a common rule in machine learning to use a random 70%/30% split to form your training and test sets. This practice can work, but is a bad idea in more and more applications where the training distribution (website images in our example above) is different from the distribution you ultimately care about (mobile phone images).

我們通常定義[1]：

訓練集——用來訓練你的學習算法
開發集——用來對訓練集訓練出來的模型進行測試，通過測試結果來不斷地優化模型。也叫交叉訓練集
測試集——在訓練結束後，對訓練出的模型進行一次最終的評估所用的數據集。

We usually define:

Training set — Which you run your learning algorithm on.
Dev (development) set — Which you use to tune parameters, select features, and make other decisions regarding the learning algorithm. Sometimes also called the holdout cross validation set.
Test set — which you use to evaluate the performance of the algorithm, but not to make any decisions about regarding what learning algorithm or parameters to use.

假設你確定開發集和測試集，你的團隊有許多點子，例如學習算法的參數該怎麼設置。爲了測試那種設置結果最優，你可以使用你的開發集和測試集來考察，並且能夠讓你的團隊知道你當前的學習算法怎麼樣。

One you define a dev set (development set) and test set, your team will try a lot of ideas, such as different learning algorithm parameters, to see what works best. The dev and test sets allow your team to quickly see how well your algorithm is doing.

換句話說，開發集和測試集能夠指引你們的團隊，在機器學習系統上，做出重大決策。

In other words, the purpose of the dev and test sets are to direct your team toward the most important changes to make to the machine learning system.

所以，現在你應該：

選擇一個適合的開發集和數據集，以回饋你未來所期望得到的數據和你想做得更好的方面的數據。

So, you should do the following:

Choose dev and test sets to reflect data you expect to get in the future and want to do well on.

換而言之，你的測試集不應該僅僅是30%的可用數據，開發/測試集應該與訓練集來自不同的一份，特別是當你期望的未來的數據（手機app圖片）與訓練集中的數據（網絡圖片）具有本質差別的時候。

In order words, your test set should not simply be 30% of the available data, especially if you expect your future data (mobile app images) to be different in nature from your training set (website images).

假如你還沒發佈你的手機軟件，你還沒用戶呢，因此沒辦法得到能夠真實反映你模型效率的數據。但是你仍舊可以嘗試模擬它，比如你可以叫你的朋友用手機拍一些圖片然後發送給你，你再用這些數據測試你的app也未嘗不可。一旦你的app發佈之後，你就可以用你的用戶數據更新你的開發集或測試集。

If you have not yet launched your mobile app, you might not have any users yet, and thus might not be able to get data that accurately reflects what you have to do well on in the future. But you might still try to approximate this. For example, ask your friends to take mobile phone pictures and send them to you. Once your app is launched, you can update your dev/test sets using actual user data.

如果你真的毫無辦法模擬實際數據，那或許你可以使用網絡圖片，但是你應該注意，這可能會導致系統功能不高的後果。

If you really don’t have any way of getting data that approximates what you expect to get in the future, perhaps you can start by using website images. But you should be aware of the risk of this leading to a system that doesn’t generalize well.

花多少時間在開發開發集和測試集上，需要你根據實際情況做出判斷。但千萬不要讓你的訓練集和測試集是在同一份數據上挑出來的。試着去找一些，能夠真正模擬在實際應用中你想要處理好數據的測試樣本，而不是隨便在用來訓練的數據上上挑幾個馬虎了事。

It requires judgment to decide how much to invest in developing great dev and test sets. But don’t assume your training distribution is the same as your test distribution. Try to pick test examples that reflect what you ultimately want to perform well on, rather than whatever data you happen to have for training.

1.一般需要將樣本分成獨立的三部分訓練集(train set)，驗證集(validation set)和測試集(test set)。其中訓練集用來估計模型，驗證集用來確定網絡結構或者控制模型複雜程度的參數，而測試集則檢驗最終選擇最優的模型的性能如何。

人能力有限，如有錯誤歡迎改正，希望不吝賜教。

——譯者：wexin_42141390 郵箱：[email protected]

machine learning yearning 第五章

Your development and test sets

開發與測試集

.Net 8.0 下的新RPC，IceRPC之試試的新玩法"打洞"

完美替代postman的軟件

Vue mockjs mock.js

關於遊戲付費的一點想法

我通過CKA和CKS啦！

《最新出爐》系列入門篇-Python+Playwright自動化測試-42-強大的可視化追蹤利器Trace Viewer

大數據怎麼學？對大數據開發領域及崗位的詳細解讀，完整理解大數據開發領域技術體系

雙變量的t檢驗

方差分析與單因素方差分析

配對變量t檢驗

參數檢驗之t檢驗

Dijkstra解決TSP問題

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結