實用機器學習筆記八:特徵工程

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"前言:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 本文是個人在 B 站自學李沐老師的實用機器學習課程【斯坦福 2021 秋季中文同步】的學習筆記,感覺沐神講解的非常棒 yyds。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"爲什麼需要特徵工程:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 首先應該弄明白什麼是特徵工程,他應該算是一個技術,就是對數據集進行特徵提取,以使機器學習模型在對經過特徵工程處理過的數據進行學習時可以更快,精度更高,效果更好。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 現在來說說爲什麼要進行特徵工程:在深度學習大火之前,傳統的機器學習模型是比較常用的,因此在進行學習的之前,需要把數據處理成模型喜歡的數據形式(","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"這個過程往往是人們手動來設計的","attrs":{}},{"type":"text","text":"),因爲機器學習算法比較“喜歡”固定長度的輸入輸出。這是一個非常關鍵的技術。比如在計算機視覺中,常常把圖片進行處理成一個向量等來訓練一個SVM模型。在深度學習技術成熟之後,人們開始使用神經網絡來進行特徵抽取(","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"讓特徵工程更加簡單","attrs":{}},{"type":"text","text":"),但是並","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"沒有改變特徵工程的這個過程以及地位","attrs":{}},{"type":"text","text":"。而且神經網絡可以不斷地改變參數來更好的去抽取特徵,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"缺點","attrs":{}},{"type":"text","text":"就是需要大量的數據和資源。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0e/0eb667b960f49173c45d6cdacf4076f9.png","alt":null,"title":"特徵提取變化","style":[{"key":"width","value":"25%"},{"key":"bordertype","value":"boxShadow"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"表格類數據特徵:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"int/float類型數據","attrs":{}},{"type":"text","text":":直接使用原始數據。或者是根據這列數據的最大值和最小值,然後分成n個相等的區間。那麼每一個數會落到這n個區間中的一個區間中,這樣一個實數就被展開成一個長度爲n的向量,並且實數落到哪個區間這個區間所在位置就是1。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"比如:","attrs":{}},{"type":"text","text":"在前面房子數據集中,一套房子房價爲100萬,和一套房子101萬其實區別不是很大,但是如果直接輸入原始數據的話,模型看到這兩個數據不一樣,就會比較在意這個區別,但是如果被使用這種切分數據區間的方式的話,就是告訴模型,不用在意這個1萬的差距。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"分類數據","attrs":{}},{"type":"text","text":":一般獨熱編碼(one-hot)。示例如下:","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/86/8666f7a7523e6e5071aa7fae7808df1c.png","alt":null,"title":"獨熱編碼\n","style":[{"key":"width","value":"50%"},{"key":"bordertype","value":"boxShadow"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 首先會有一個字典,裏面有類別清單。如果某個數據是貓,那麼在獨熱編碼的特徵向量裏,只有貓對應的位置爲1,其他位置爲0。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"另外:","attrs":{}},{"type":"text","text":"還是根據房子數據集來說事,一般情況下,房子的類別不過是十幾種,但是在數據集中房子類別出現了上百上千種,實際上類別也就前十種比較重要,其他可能是噪音或者是非常不重要,可以忽略,那麼我們在處理數據時,就可以把除前十種之外的類型設置爲unknown。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"時間數據","attrs":{}},{"type":"text","text":":可以使用如下方式:因爲如果只有年月日的話,就分不出工作日還是週末,但是人在週末和工作日會做不同的事情,使用如下的編碼方式,可以儘量讓機器學習算法可以學到這些特徵。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":6,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[year, month, day, day_of_year, week_of_year, day_of_week]","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"特徵組合","attrs":{}},{"type":"text","text":":可以讓機器學習算法學習到兩兩特徵之間的關係。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[cat, dog]* [male, female]---->","attrs":{}}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[(cat, male), (cat, female), (dog, male), (dog, female)] 同樣是獨熱編碼","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"文本特徵:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"詞元(token)特徵:","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Bag of words(BoW) model:","attrs":{}},{"type":"text","text":"前提是有一個完整的字典。文本中的每一個在數據處理時就已經別分成了一個個單獨的詞(也就是詞元),然後對每一個詞根據字典進行獨熱編碼,接着把這句話中的每一個詞的獨熱編碼相加。實例如下:","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/be/be10025708e0bc3ad61da23bde99c600.png","alt":null,"title":"BoW模型\n","style":[{"key":"width","value":"50%"},{"key":"bordertype","value":"boxShadow"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 缺點:需要認真設計詞典,不能太大也不能太小。破壞了一個句子的時序信息。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"詞嵌入(Word Embeding): ","attrs":{}},{"type":"text","text":"先訓練一個詞嵌入模型","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":",","attrs":{}},{"type":"text","text":"常見的是word2vec,他會把一個詞表示成一個向量,這個向量是帶有語義信息的。如果兩個詞向量內積越小,就表示這兩個向量比較接近,也就是說這兩個詞語義相近。這是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"因爲","attrs":{}},{"type":"text","text":"Word2vec在訓練時是通過一個詞的上下文進行訓練的。那麼","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"這句話的詞嵌入表示","attrs":{}},{"type":"text","text":"如何得到呢?把每個詞輸入到word2vec,得到沒個詞的向量,然後向量相加或者平均。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"預訓練好的語言模型(BERT,GPT-3):","attrs":{}},{"type":"text","text":"以上兩個模型都是模型參數很大,裏面有包含目前最流行的transformer。而且是使用大量的無標註的數據自監督學習來訓練,可以抽取很好地數據特徵。缺點就是比較貴。他的作用和Word2vec一樣輸出每個單詞的詞向量。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"圖片和視頻特徵:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"傳統的方法:","attrs":{}},{"type":"text","text":"傳統的方法通常是手工來抽取,比如SIFT。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"現在的神經網絡方法:","attrs":{}},{"type":"text","text":"深度學習流行之後,使用預訓練好的深度神經網絡來進行特徵抽取。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"比如:","attrs":{}},{"type":"text","text":"事先在ImageNet數據集上訓練好了一個ResNet模型。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"那如何拿到抽取的特徵呢?","attrs":{}},{"type":"text","text":"把圖片輸入到這個訓練好的模型裏面,然後在這個模型要做分類的那層神經網絡的前一層(也就是輸出層開始算的倒數第二層)的輸出結果拿出來就是抽取的特徵,可以把這個特徵拿出來用到別的任務中。圖示如下:","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d0/d06d7f7884847db4cb0c59cd1c6dfa4a.png","alt":null,"title":"預訓練模型抽取特徵","style":[{"key":"width","value":"50%"},{"key":"bordertype","value":"boxShadow"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章