機器學習中常見問題-特徵選擇

原創

huangqihao723

2020-06-13 13:59

特徵選擇方法的三大類型 [1]：

1.filter method ：利用一些統計指標進行特徵選擇，和模型沒有關係

2.wrapper method：結合模型來做，每次加入或者減少特徵看對模型的準確度是否有提升，如果有提升，那麼就增加或者減少，所以

需要不斷構建模型來判斷是否要加入特徵

3.embedded method：結合模型來做，和模型訓練一起做，即模型訓練完，特徵就出來了；

所以，wrapper method 要不斷的構建模型，花費的資源是比較多的！

filter的部分方法 [2] :

information gain 信息增益
chi-square test 卡方統計量
fisher score [4]
correlation coefficient [1] 相關係數，一般用pearson/spearman/kendall 相關係數,一定要注意下圖中的assumptions，即適用的條件

可視化：

from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt

iris=load_iris()
df=pd.DataFrame(iris.data,columns=["x1","x2","x3","x4"])
# print(df.info()) # 均非空
plt.figure(figsize=(8,6))
df_corr=df.corr(method="pearson") # 這裏可以指定響應的方法
sb.heatmap(df_corr,annot=True)
plt.show()

假設把x1作爲y，那麼x2與x1的相關係數爲-0.12 , x3 與x1的相關係數爲0.87，x4與x1的相關係數爲0.82 ，認爲x2與x1相關性很弱，可以刪除，常見的做法，是設定一個閾值，來判斷

# 定位x1的相關係數
x1_corr=abs(df_corr["x1"])
print(x1_corr[x1_corr>0.5])

variance threshold 方差選擇 [3] ，刪除方差小於某個閾值的特徵，默認爲0

from sklearn.feature_selection import VarianceThreshold

X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
selector=VarianceThreshold()  # 默認是0
X=selector.fit_transform(X)
print(selector.variances_)
print()
print(X)

# 輸出

[0.         0.22222222 2.88888889 0.        ]

[[2 0]
 [1 4]
 [1 1]]

wapper的部分方法[2]:

recursive feature elimination
sequential feature selection algorithms
genetic algorithms

embedded的部分方法[2]:

L1 (LASSO) regularization
decision tree

refer:

[1] https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b

[2] https://sebastianraschka.com/faq/docs/feature_sele_categories.html#:~:text=The%20third%20class%2C%20embedded%20methods,metric%20is%20used%20during%20learning.

[3] https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html

[4] https://arxiv.org/pdf/1202.3725.pdf

[5] https://datascience.stackexchange.com/questions/64260/pearson-vs-spearman-vs-kendall#:~:text=In%20the%20normal%20case%2C%20Kendall,small%20samples%20or%20some%20outliers.

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

機器學習中常見問題-特徵選擇

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

Spark同步mysql數據到hive

Pytorch autoencoder降維

Pytorch-基於colab對中文評論使用LSTM進行情感分析

RNN的幾種結構

Isolation Forest 孤立森林的理解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結