PCA在Spark2.0用法比較簡單,只需要設置:
.setInputCol(“features”)//保證輸入是特徵值向量
.setOutputCol(“pcaFeatures”)//輸出
.setK(3)//主成分個數
注意:PCA前一定要對特徵向量進行規範化(標準化)!!!
//Spark 2.0 PCA主成分分析
//注意:PCA降維前必須對原始數據(特徵向量)進行標準化處理
package my.spark.ml.practice;
import org.apache.spark.ml.feature.PCA;
import org.apache.spark.ml.feature.PCAModel;//不是mllib
import org.apache.spark.ml.feature.StandardScaler;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class myPCA {
public static void main(String[] args) {
SparkSession spark=SparkSession
.builder()
.appName("myLR")
.master("local[4]")
.getOrCreate();
Dataset<Row> rawDataFrame=spark.read().format("libsvm")
.load("/home/hadoop/spark/spark-2.0.0-bin-hadoop2.6" +
"/data/mllib/sample_libsvm_data.txt");
//首先對特徵向量進行標準化
Dataset<Row> scaledDataFrame=new StandardScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.setWithMean(false)//對於稀疏數據(如本次使用的數據),不要使用平均值
.setWithStd(true)
.fit(rawDataFrame)
.transform(rawDataFrame);
//PCA Model
PCAModel pcaModel=new PCA()
.setInputCol("scaledFeatures")
.setOutputCol("pcaFeatures")
.setK(3)//
.fit(scaledDataFrame);
//進行PCA降維
pcaModel.transform(scaledDataFrame).select("label","pcaFeatures").show(100,false);
}
}
/**
* 沒有標準化特徵向量,直接進行PCA主成分:各主成分之間值變化太大,有數量級的差別。
+-----+------------------------------------------------------------+
|label|pcaFeatures |
+-----+------------------------------------------------------------+
|0.0 |[-1730.496937303442,6.811910953794295,2.8044962135250024] |
|1.0 |[290.7950975587044,21.14756134360174,0.7002807351637692] |
|1.0 |[149.4029441007031,-13.733854376555671,9.844080682283838] |
|1.0 |[200.47507801105797,18.739201694569232,22.061802015132024] |
|1.0 |[236.57576401934855,36.32142445435475,56.49778957910826] |
|0.0 |[-1720.2537550195714,25.318146742090196,2.8289957152580136] |
|1.0 |[285.94940382351075,-6.729431266185428,-33.69780131162192] |
|1.0 |[-323.70613777909136,2.72250162998038,-0.528081577573507] |
|0.0 |[-1150.8358810584655,5.438673892459839,3.3725913786301804] |
*/
/**
* 標準化特徵向量後PCA主成分,各主成分之間值基本上在同一水平上,結果更合理
|label|pcaFeatures |
+-----+-------------------------------------------------------------+
|0.0 |[-14.998868464839624,-10.137788261664621,-3.042873539670117] |
|1.0 |[2.1965800525589754,-4.139257418439533,-11.386135042845101] |
|1.0 |[1.0254645688925883,-0.8905813756164163,7.168759904518129] |
|1.0 |[1.5069317554093433,-0.7289177578028571,5.23152743564543] |
|1.0 |[1.6938250375084654,-0.4350617717494331,4.770263568537382] |
|0.0 |[-15.870371979062549,-9.999445137658528,-6.521920373215663] |
|1.0 |[3.023279951602481,-4.102323190311296,-9.451729897327345] |
|1.0 |[3.500670997961283,-4.1791886802435805,-9.306353932746568] |
|0.0 |[-15.323114679599747,-16.83241059234951,2.0282183995400374] |
*/
如何選擇k值?
//PCA Model
PCAModel pcaModel=new PCA()
.setInputCol("scaledFeatures")
.setOutputCol("pcaFeatures")
.setK(100)//
.fit(scaledDataFrame);
int i=1;
for(double x:pcaModel.explainedVariance().toArray()){
System.out.println(i+"\t"+x+" ");
i++;
}
輸出100個降序的explainedVariance(和scikit-learn中PCA一樣):
1 0.25934799275530857
2 0.12355355301486977
3 0.07447670060988294
4 0.0554545717486928
5 0.04207050513264405
6 0.03715986573644129
7 0.031350566055423544
8 0.027797304129489515
9 0.023825873477496748
10 0.02268054946233242
11 0.021320060154167115
12 0.019764029918116235
13 0.016789082901450734
14 0.015502412597350008
15 0.01378190652256973
16 0.013539546429755526
17 0.013283518226716669
18 0.01110412833334044
...
大約選擇20個主成分就足夠了
隨便做一個圖可以選擇了(詳細可參考Scikit-learn例子)
http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html
Scikit中使用PCA
參考http://blog.csdn.net/u012162613/article/details/42192293
sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False)
參數說明:
n_components:
意義:PCA算法中所要保留的主成分個數n,也即保留下來的特徵個數n
類型:int 或者 string,缺省時默認爲None,所有成分被保留。
賦值爲int,比如n_components=1,將把原始數據降到一個維度。
賦值爲string,比如n_components=’mle’,將自動選取特徵個數n,使得滿足所要求的方差百分比。
copy:
類型:bool,True或者False,缺省時默認爲True。
意義:表示是否在運行算法時,將原始訓練數據複製一份。若爲True,則運行PCA算法後,原始訓練數據的值不 會有任何改變,因爲是在原始數據的副本上進行運算;若爲False,則運行PCA算法後,原始訓練數據的 值會改,因爲是在原始數據上進行降維計算。
whiten:
類型:bool,缺省時默認爲False
意義:白化,使得每個特徵具有相同的方差。關於“白化”,可參考:Ufldl教程
簡單例子:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
from sklearn import datasets
from sklearn.decomposition import PCA
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
pca = PCA(n_components=3)
X_r = pca.fit(X).transform(X)
print "X_r"
print X_r
print "X"
print X
print "pca.explained_variance_ratio"
print pca.explained_variance_ratio_