簡介
ChiSqSelector代表卡方特徵選擇。它適用於帶有類別特徵的標籤數據。ChiSqSelector根據獨立卡方檢驗,然後選取類別標籤主要依賴的特徵。它類似於選取最有預測能力的特徵。它支持三種特徵選取方法:
1、numTopFeatures:通過卡方檢驗選取最具有預測能力的Top(num)個特徵;
2、percentile:類似於上一種方法,但是選取一小部分特徵而不是固定(num)個特徵;
3、fpr:選擇P值低於門限值的特徵,這樣就可以控制false positive rate來進行特徵選擇;
默認情況下特徵選擇方法是numTopFeatures(50),可以根據setSelectorType()選擇特徵選取方法。
示例:假設我們有一個DataFrame含有id,features和clicked三列,其中clicked爲需要預測的目標:
id | features | clicked |
---|---|---|
7 | [0.0, 0.0, 18.0, 1.0] | 1.0 |
8 | [0.0, 1.0, 12.0, 0.0] | 0.0 |
9 | [1.0, 0.0, 15.0, 0.1] | 0.0 |
如果我們使用ChiSqSelector並設置numTopFeatures爲1,根據標籤clicked,features中最後一列將會是最有用特徵:
id | features | clicked |
---|---|---|
7 | [0.0, 0.0, 18.0, 1.0] | 1.0 |
8 | [0.0, 1.0, 12.0, 0.0] | 0.0 |
9 | [1.0, 0.0, 15.0, 0.1] | 0.0 |
代碼
val data = Seq(
(7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
(8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
(9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
)
//val df = spark.createDataset(data).toDF("id", "features", "clicked")
val df = spark.createDataFrame(data).toDF("id", "features", "clicked")
val selector = new ChiSqSelector()
.setNumTopFeatures(1)
.setFeaturesCol("features")
.setLabelCol("clicked")
.setOutputCol("selectedFeatures")
val result = selector.fit(df).transform(df)
println(s"ChiSqSelector output with top ${selector.getNumTopFeatures} features selected")
result.show()
+---+------------------+-------+----------------+
| id| features|clicked|selectedFeatures|
+---+------------------+-------+----------------+
| 7|[0.0,0.0,18.0,1.0]| 1.0| [18.0]|
| 8|[0.0,1.0,12.0,0.0]| 0.0| [12.0]|
| 9|[1.0,0.0,15.0,0.1]| 0.0| [15.0]|
+---+------------------+-------+----------------+
本文來自 行者小朱 的CSDN 博客 ,全文地址請點擊:https://blog.csdn.net/u012050154/article/details/60766387?utm_source=copy