Spark MLlib Statistics統計

1、Spark MLlib Statistics統計

Spark Mllib 統計模塊代碼結構如下:

1.1 列統計彙總

計算每列最大值、最小值、平均值、方差值、L1範數、L2範數。

    //讀取數據,轉換成RDD[Vector]類型

    val data_path = "/home/jb-huangmeiling/sample_stat.txt"

    val data = sc.textFile(data_path).map(_.split("\t")).map(f => f.map(f => f.toDouble))

    val data1 = data.map(f => Vectors.dense(f))   

    //計算每列最大值、最小值、平均值、方差值、L1範數、L2範數

    val stat1 = Statistics.colStats(data1)

    stat1.max

    stat1.min

    stat1.mean

    stat1.variance

    stat1.normL1

    stat1.normL2

執行結果:

數據

1

2

3

4

5

6

7

1

5

9

3

5

6

3

1

3

1

1

5

6

 

scala> data1.collect

res19: Array[org.apache.spark.mllib.linalg.Vector] = Array([1.0,2.0,3.0,4.0,5.0], [6.0,7.0,1.0,5.0,9.0], [3.0,5.0,6.0,3.0,1.0], [3.0,1.0,1.0,5.0,6.0])

 

scala>     stat1.max

res20: org.apache.spark.mllib.linalg.Vector = [6.0,7.0,6.0,5.0,9.0]

 

scala>     stat1.min

res21: org.apache.spark.mllib.linalg.Vector = [1.0,1.0,1.0,3.0,1.0]

 

scala>     stat1.mean

res22: org.apache.spark.mllib.linalg.Vector = [3.25,3.75,2.75,4.25,5.25]

 

scala>     stat1.variance

res23: org.apache.spark.mllib.linalg.Vector = [4.25,7.583333333333333,5.583333333333333,0.9166666666666666,10.916666666666666]

 

scala>     stat1.normL1

res24: org.apache.spark.mllib.linalg.Vector = [13.0,15.0,11.0,17.0,21.0]

 

scala>     stat1.normL2

res25: org.apache.spark.mllib.linalg.Vector = [7.416198487095663,8.888194417315589,6.855654600401044,8.660254037844387,11.958260743101398]

1.2 相關係數

Pearson相關係數表達的是兩個數值變量的線性相關性, 它一般適用於正態分佈。其取值範圍是[-1, 1], 當取值爲0表示不相關,取值爲(0~-1]表示負相關,取值爲(0, 1]表示正相關。

Spearman相關係數也用來表達兩個變量的相關性,但是它沒有Pearson相關係數對變量的分佈要求那麼嚴格,另外Spearman相關係數可以更好地用於測度變量的排序關係。其計算公式爲:

     //計算pearson係數、spearman相關係數

    val corr1 = Statistics.corr(data1, "pearson")

    val corr2 = Statistics.corr(data1, "spearman")

    val x1 = sc.parallelize(Array(1.0, 2.0, 3.0, 4.0))

    val y1 = sc.parallelize(Array(5.0, 6.0, 6.0, 6.0))

    val corr3 = Statistics.corr(x1, y1, "pearson")

scala> corr1

res6: org.apache.spark.mllib.linalg.Matrix =

1.0                   0.7779829610026362    -0.39346431156047523  ... (5 total)

0.7779829610026362    1.0                   0.14087521363240252   ...

-0.39346431156047523  0.14087521363240252   1.0                   ...

0.4644203640128242    -0.09482093118615205  -0.9945577827230707   ...

0.5750122832421579    0.19233705001984078   -0.9286374704669208   ...

 

scala> corr2

res7: org.apache.spark.mllib.linalg.Matrix =

1.0                  0.632455532033675     -0.5000000000000001  ... (5 total)

0.632455532033675    1.0                   0.10540925533894883  ...

-0.5000000000000001  0.10540925533894883   1.0                  ...

0.5000000000000001   -0.10540925533894883  -1.0000000000000002  ...

0.6324555320336723   0.20000000000000429   -0.9486832980505085  ...

 

scala> corr3

res8: Double = 0.7745966692414775

1.3 假設檢驗

MLlib當前支持用於判斷擬合度或者獨立性的Pearson卡方(chi-squared ( χ2) )檢驗。不同的輸入類型決定了是做擬合度檢驗還是獨立性檢驗。擬合度檢驗要求輸入爲Vector, 獨立性檢驗要求輸入是Matrix

    //卡方檢驗

    val v1 = Vectors.dense(43.0, 9.0)

    val v2 = Vectors.dense(44.0, 4.0)   

val c1 = Statistics.chiSqTest(v1, v2)

執行結果:

c1: org.apache.spark.mllib.stat.test.ChiSqTestResult =

Chi squared test summary:

method: pearson

degrees of freedom = 1

statistic = 5.482517482517483

pValue = 0.01920757707591003

Strong presumption against null hypothesis: observed follows the same distribution as expected..

結果返回:統計量pearson、自由度:1、值:5.48、概率:0.019

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章