Spark MLlib Statistics統計

原創

sunbow0

2020-02-23 22:33

1、Spark MLlib Statistics統計

Spark Mllib 統計模塊代碼結構如下：

1.1 列統計彙總

計算每列最大值、最小值、平均值、方差值、L1範數、L2範數。

//讀取數據，轉換成RDD[Vector]類型

val data_path = "/home/jb-huangmeiling/sample_stat.txt"

val data = sc.textFile(data_path).map(_.split("\t")).map(f => f.map(f => f.toDouble))

val data1 = data.map(f => Vectors.dense(f))

//計算每列最大值、最小值、平均值、方差值、L1範數、L2範數

val stat1 = Statistics.colStats(data1)

stat1.max

stat1.min

stat1.mean

stat1.variance

stat1.normL1

stat1.normL2

執行結果：

數據

1	2	3	4	5
6	7	1	5	9
3	5	6	3	1
3	1	1	5	6

scala> data1.collect

res19: Array[org.apache.spark.mllib.linalg.Vector] = Array([1.0,2.0,3.0,4.0,5.0], [6.0,7.0,1.0,5.0,9.0], [3.0,5.0,6.0,3.0,1.0], [3.0,1.0,1.0,5.0,6.0])

scala> stat1.max

res20: org.apache.spark.mllib.linalg.Vector = [6.0,7.0,6.0,5.0,9.0]

scala> stat1.min

res21: org.apache.spark.mllib.linalg.Vector = [1.0,1.0,1.0,3.0,1.0]

scala> stat1.mean

res22: org.apache.spark.mllib.linalg.Vector = [3.25,3.75,2.75,4.25,5.25]

scala> stat1.variance

res23: org.apache.spark.mllib.linalg.Vector = [4.25,7.583333333333333,5.583333333333333,0.9166666666666666,10.916666666666666]

scala> stat1.normL1

res24: org.apache.spark.mllib.linalg.Vector = [13.0,15.0,11.0,17.0,21.0]

scala> stat1.normL2

res25: org.apache.spark.mllib.linalg.Vector = [7.416198487095663,8.888194417315589,6.855654600401044,8.660254037844387,11.958260743101398]

1.2 相關係數

Pearson相關係數表達的是兩個數值變量的線性相關性, 它一般適用於正態分佈。其取值範圍是[-1, 1], 當取值爲0表示不相關，取值爲(0~-1]表示負相關，取值爲(0, 1]表示正相關。

Spearman相關係數也用來表達兩個變量的相關性，但是它沒有Pearson相關係數對變量的分佈要求那麼嚴格，另外Spearman相關係數可以更好地用於測度變量的排序關係。其計算公式爲：

//計算pearson係數、spearman相關係數

val corr1 = Statistics.corr(data1, "pearson")

val corr2 = Statistics.corr(data1, "spearman")

val x1 = sc.parallelize(Array(1.0, 2.0, 3.0, 4.0))

val y1 = sc.parallelize(Array(5.0, 6.0, 6.0, 6.0))

val corr3 = Statistics.corr(x1, y1, "pearson")

scala> corr1

res6: org.apache.spark.mllib.linalg.Matrix =

1.0 0.7779829610026362 -0.39346431156047523 ... (5 total)

0.7779829610026362 1.0 0.14087521363240252 ...

-0.39346431156047523 0.14087521363240252 1.0 ...

0.4644203640128242 -0.09482093118615205 -0.9945577827230707 ...

0.5750122832421579 0.19233705001984078 -0.9286374704669208 ...

scala> corr2

res7: org.apache.spark.mllib.linalg.Matrix =

1.0 0.632455532033675 -0.5000000000000001 ... (5 total)

0.632455532033675 1.0 0.10540925533894883 ...

-0.5000000000000001 0.10540925533894883 1.0 ...

0.5000000000000001 -0.10540925533894883 -1.0000000000000002 ...

0.6324555320336723 0.20000000000000429 -0.9486832980505085 ...

scala> corr3

res8: Double = 0.7745966692414775

1.3 假設檢驗

MLlib當前支持用於判斷擬合度或者獨立性的Pearson卡方(chi-squared ( χ2) )檢驗。不同的輸入類型決定了是做擬合度檢驗還是獨立性檢驗。擬合度檢驗要求輸入爲Vector, 獨立性檢驗要求輸入是Matrix。

//卡方檢驗

val v1 = Vectors.dense(43.0, 9.0)

val v2 = Vectors.dense(44.0, 4.0)

val c1 = Statistics.chiSqTest(v1, v2)

執行結果：

c1: org.apache.spark.mllib.stat.test.ChiSqTestResult =

Chi squared test summary:

method: pearson

degrees of freedom = 1

statistic = 5.482517482517483

pValue = 0.01920757707591003

Strong presumption against null hypothesis: observed follows the same distribution as expected..

結果返回：統計量：pearson、自由度：1、值：5.48、概率：0.019。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Spark MLlib Statistics統計

1、Spark MLlib Statistics統計

1.1 列統計彙總

1.2 相關係數

1.3 假設檢驗

一鍵自動化博客發佈工具,用過的人都說好(頭條篇)

Spark MLlib FPGrowth算法

Spark MLlib SVM算法

Spark MLlib Deep Learning Deep Belief Network (深度學習-深度信念網絡)2.2

Spark MLlib Deep Learning Neural Net(深度學習-神經網絡)1.3

Spark Mlib BLAS線性代數運算庫

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結