特徵選擇

特徵選擇

原創

2019-10-26 03:26

1、去掉方差較小的特徵

刪除方差不滿足某個閾值的所有特徵，默認情況下，它刪除所有0方差特徵，即在所有樣本中具有相同值的特徵。

from sklearn.datasets import load_boston
from sklearn.feature_selection import VarianceThreshold

boston = load_boston()

x = boston.data
y = boston.target

# 刪除方差不滿足某個閾值的所有特徵，默認情況下，它刪除所有0方差特徵，即在所有樣本中具有相同值的特徵。
select_feature = VarianceThreshold(threshold=(0.8*(1-0.8)))

x_n = select_feature.fit_transform(x)

print(x.shape, x_n.shape)  # (506, 13) (506, 11)

2、單變量特徵選擇

單變量特徵選擇通過單變量統計檢驗選擇特徵

SelectKBest只保留K個最高分的特徵

SelectPercentile只保留用戶指定的百分比的最高得分的特徵

使用常見的單變量統計檢驗

分類:

使用chi2卡方

f_classif方差分析

mutual_info_classif互信息

迴歸:

使用f_regression相關係數

mutual_info_regression互信息

# 分類使用chi2
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, chi2

iris = load_iris()

x = iris.data
y = iris.target

x_n = SelectKBest(chi2, k=2).fit_transform(x, y)

print(x.shape, x_n.shape)  # (150, 4) (150, 2)

# 迴歸使用f_refression，計算每個變量和目標變量的相關係數，計算f score，p-value
from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectKBest, f_regression

boston = load_boston()

x = boston.data
y = boston.target

x_n = SelectKBest(f_regression, k=10).fit_transform(x, y)

print(x.shape, x_n.shape)  # (506, 13) (506, 10)

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectPercentile, chi2

iris = load_iris()

x = iris.data
y = iris.target

x_n = SelectPercentile(chi2, percentile=50).fit_transform(x, y)

print(x.shape, x_n.shape)  # (150, 4) (150, 2)

3、遞歸特徵消除

給特徵賦予一個外部模型產生的權重（比如：線性模型係數），RFE遞歸使用越來越少的特徵進行特徵選擇。首先，在原始數據上建立模型並且給每個特徵一個權重；然後，消除絕對權重最小的特徵，遞歸執行這個過程直到達到希望的特徵數。

from sklearn.datasets import load_iris
from sklearn.feature_selection import RFE
from sklearn.svm import SVC

iris = load_iris()

x = iris.data
y = iris.target

svc = SVC(kernel='linear', C=1)

rfe = RFE(estimator=svc, n_features_to_select=3).fit(x, y)

x_n = rfe.transform(x)

print(x.shape, x_n.shape)  # (150, 4) (150, 3)

4、使用SelectFromModel方法特徵選擇

SelectFromModel是一個元轉換器，可以與那些有coef_或者fearure_importances_屬性的模型一起使用。如果coef_或者fearure_importances_小於閾值，認爲特徵是不重要的。

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits

digits = load_digits()

x = digits.data
y = digits.target

rf = RandomForestClassifier(n_estimators=100)
select = SelectFromModel(rf, threshold='median')

select.fit(x, y)

x_n = select.transform(x)

print(x.shape, x_n.shape)  # (1797, 64) (1797, 32)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

Java ThreadPoolShutdown

5月21日相聚上海張江！與文心大模型一起共建大模型產業應用生態圈

通義千問 2.5 “客串” ChatGPT4，你分的清嗎？

“她”來了，陪伴賽道鉅變！爲GPT-4o加上你的一個數字分身

京東秒送售後系統退款業務重構心得| 京東零售技術團隊

文本轉化爲向量

初次使用BERT的可視化指南

神經網絡的前向傳播和反向傳播推導

numpy庫

機器學習：K近鄰（KNN）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結