Spark MLlib KMeans聚類算法

1.1 KMeans聚類算法

1.1.1 基礎理論

KMeans算法的基本思想是初始隨機給定K個簇中心，按照最鄰近原則把待分類樣本點分到各個簇。然後按平均法重新計算各個簇的質心，從而確定新的簇心。一直迭代，直到簇心的移動距離小於某個給定的值。

K-Means聚類算法主要分爲三個步驟：

(1)第一步是爲待聚類的點尋找聚類中心；

(2)第二步是計算每個點到聚類中心的距離，將每個點聚類到離該點最近的聚類中去；

(3)第三步是計算每個聚類中所有點的座標平均值，並將這個平均值作爲新的聚類中心；

反覆執行(2)、(3)，直到聚類中心不再進行大範圍移動或者聚類次數達到要求爲止。

1.1.2過程演示

下圖展示了對n個樣本點進行K-means聚類的效果，這裏k取2：

(a)未聚類的初始點集；

(b)隨機選取兩個點作爲聚類中心；

(c)計算每個點到聚類中心的距離，並聚類到離該點最近的聚類中去；

(d)計算每個聚類中所有點的座標平均值，並將這個平均值作爲新的聚類中心；

(e)重複(c),計算每個點到聚類中心的距離，並聚類到離該點最近的聚類中去；

(f)重複(d),計算每個聚類中所有點的座標平均值，並將這個平均值作爲新的聚類中心。

參照以下文檔：

http://blog.sina.com.cn/s/blog_62186b46010145ne.html

1.2 Spark Mllib KMeans源碼分析

class KMeansprivate (

privatevar k: Int,

privatevar maxIterations: Int,

privatevar runs: Int,

privatevar initializationMode: String,

privatevar initializationSteps: Int,

privatevar epsilon: Double,

privatevar seed: Long)extends Serializablewith Logging {

// KMeans類參數：

k:聚類個數，默認2；maxIterations：迭代次數，默認20；runs：並行度，默認1；

initializationMode：初始中心算法，默認"k-means||"；initializationSteps：初始步長，默認5；epsilon：中心距離閾值，默認1e-4；seed：隨機種子。

/**

* Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, runs: 1,

* initializationMode: "k-means||", initializationSteps: 5, epsilon: 1e-4, seed: random}.

defthis() =this(2,20, 1, KMeans.K_MEANS_PARALLEL,5, 1e-4, Utils.random.nextLong())

// 參數設置

/** Set the number of clusters to create (k). Default: 2. */

def setK(k: Int):this.type = {

this.k = k

this

}

**省略各個參數設置代碼**

// run方法，KMeans主入口函數

/**

* Train a K-means model on the given set of points; `data` should be cached for high

* performance, because this is an iterative algorithm.

def run(data: RDD[Vector]): KMeansModel = {

if (data.getStorageLevel == StorageLevel.NONE) {

logWarning("The input data is not directly cached, which may hurt performance if its"

+ " parent RDDs are also uncached.")

}

// Compute squared norms and cache them.

// 計算每行數據的L2範數，數據轉換：data[Vector]=> data[(Vector, norms)]，其中norms是Vector的L2範數，norms就是：。

val norms = data.map(Vectors.norm(_,2.0))

norms.persist()

val zippedData = data.zip(norms).map {case (v, norm) =>

new VectorWithNorm(v, norm)

}

val model = runAlgorithm(zippedData)

norms.unpersist()

// Warn at the end of the run as well, for increased visibility.

if (data.getStorageLevel == StorageLevel.NONE) {

logWarning("The input data was not directly cached, which may hurt performance if its"

+ " parent RDDs are also uncached.")

}

model

}

// runAlgorithm方法，KMeans實現方法。

/**

* Implementation of K-Means algorithm.

privatedef runAlgorithm(data: RDD[VectorWithNorm]): KMeansModel = {

val sc = data.sparkContext

val initStartTime = System.nanoTime()

val centers =if (initializationMode == KMeans.RANDOM) {

initRandom(data)

} else {

initKMeansParallel(data)

}

val initTimeInSeconds = (System.nanoTime() - initStartTime) /1e9

logInfo(s"Initialization with $initializationMode took " +"%.3f".format(initTimeInSeconds) +

" seconds.")

val active = Array.fill(runs)(true)

val costs = Array.fill(runs)(0.0)

var activeRuns =new ArrayBuffer[Int] ++ (0 until runs)

var iteration =0

val iterationStartTime = System.nanoTime()

//KMeans迭代執行，計算每個樣本屬於哪個中心點，中心點累加樣本的值及計數，然後根據中心點的所有的樣本數據進行中心點的更新，並比較更新前的數值，判斷是否完成。其中runs代表並行度。

// Execute iterations of Lloyd's algorithm until all runs have converged

while (iteration < maxIterations && !activeRuns.isEmpty) {

type WeightedPoint = (Vector, Long)

def mergeContribs(x: WeightedPoint, y: WeightedPoint): WeightedPoint = {

axpy(1.0, x._1, y._1)

(y._1, x._2 + y._2)

}

val activeCenters = activeRuns.map(r => centers(r)).toArray

val costAccums = activeRuns.map(_ => sc.accumulator(0.0))

val bcActiveCenters = sc.broadcast(activeCenters)

// Find the sum and count of points mapping to each center

//計算屬於每個中心點的樣本，對每個中心點的樣本進行累加和計算；

runs代表並行度，k中心點個數，sums代表中心點樣本累加值，counts代表中心點樣本計數；

contribs代表（（並行度I，中心J），（中心J樣本之和，中心J樣本計數和））；

findClosest方法：找到點與所有聚類中心最近的一箇中心；

val totalContribs = data.mapPartitions { points =>

val thisActiveCenters = bcActiveCenters.value

val runs = thisActiveCenters.length

val k = thisActiveCenters(0).length

val dims = thisActiveCenters(0)(0).vector.size

val sums = Array.fill(runs, k)(Vectors.zeros(dims))

val counts = Array.fill(runs, k)(0L)

points.foreach { point =>

(0 until runs).foreach { i =>

val (bestCenter, cost) = KMeans.findClosest(thisActiveCenters(i), point)

costAccums(i) += cost

val sum = sums(i)(bestCenter)

axpy(1.0, point.vector, sum)

counts(i)(bestCenter) += 1

}

val contribs =for (i <-0 until runs; j <-0 until k) yield {

((i, j), (sums(i)(j), counts(i)(j)))

}

contribs.iterator

}.reduceByKey(mergeContribs).collectAsMap()

//更新中心點，更新中心點= sum/count；

判斷newCenter與centers之間的距離是否 > epsilon * epsilon;

// Update the cluster centers and costs for each active run

for ((run, i) <- activeRuns.zipWithIndex) {

var changed =false

var j =0

while (j < k) {

val (sum, count) = totalContribs((i, j))

if (count !=0) {

scal(1.0 / count, sum)

val newCenter =new VectorWithNorm(sum)

if (KMeans.fastSquaredDistance(newCenter, centers(run)(j)) > epsilon * epsilon) {

changed = true

}

centers(run)(j) = newCenter

}

j += 1

}

if (!changed) {

active(run) = false

logInfo("Run " + run +" finished in " + (iteration +1) + " iterations")

}

costs(run) = costAccums(i).value

}

activeRuns = activeRuns.filter(active(_))

iteration += 1

}

val iterationTimeInSeconds = (System.nanoTime() - iterationStartTime) /1e9

logInfo(s"Iterations took " +"%.3f".format(iterationTimeInSeconds) +" seconds.")

if (iteration == maxIterations) {

logInfo(s"KMeans reached the max number of iterations: $maxIterations.")

} else {

logInfo(s"KMeans converged in $iteration iterations.")

}

val (minCost, bestRun) = costs.zipWithIndex.min

logInfo(s"The cost for the best run is $minCost.")

new KMeansModel(centers(bestRun).map(_.vector))

}

//findClosest方法：找到點與所有聚類中心最近的一箇中心；

/**

* Returns the index of the closest center to the given point, as well as the squared distance.

private[mllib]def findClosest(

centers: TraversableOnce[VectorWithNorm],

point: VectorWithNorm): (Int, Double) = {

var bestDistance = Double.PositiveInfinity

var bestIndex =0

var i =0

centers.foreach { center =>

// Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary

// distance computation.

var lowerBoundOfSqDist = center.norm - point.norm

lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist

if (lowerBoundOfSqDist < bestDistance) {

val distance: Double = fastSquaredDistance(center, point)

if (distance < bestDistance) {

bestDistance = distance

bestIndex = i

}

i += 1

}

(bestIndex, bestDistance)

}

findClosest方法中：var lowerBoundOfSqDist = center.norm - point.norm

lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist

如果中心點center是(a1,b1)，需要計算的點point是(a2,b2)，那麼lowerBoundOfSqDist是：

如下是展開式，第二個是真正計算歐式距離時的除去開平方的公式。（在查找最短距離的時候無需計算開方，因爲只需要計算出開方里面的式子就可以進行比較了，mllib也是這樣做的）

可輕易證明上面兩式的第一式將會小於等於第二式，因此在進行距離比較的時候，先計算很容易計算的lowerBoundOfSqDist，如果lowerBoundOfSqDist都不小於之前計算得到的最小距離bestDistance，那真正的歐式距離也不可能小於bestDistance了，因此這種情況下就不需要去計算歐式距離，省去很多計算工作。

如果lowerBoundOfSqDist小於了bestDistance，則進行距離的計算，調用fastSquaredDistance，這個方法將調用MLUtils.scala裏面的fastSquaredDistance方法，計算真正的歐式距離，代碼如下：

/**

* Returns the squared Euclidean distance between two vectors. The following formula will be used

* if it does not introduce too much numerical error:

* <pre>

* \|a - b\|_2^2 = \|a\|_2^2 + \|b\|_2^2 - 2 a^T b.

* </pre>

* When both vector norms are given, this is faster than computing the squared distance directly,

* especially when one of the vectors is a sparse vector.

* @param v1 the first vector

* @param norm1 the norm of the first vector, non-negative

* @param v2 the second vector

* @param norm2 the norm of the second vector, non-negative

* @param precision desired relative precision for the squared distance

* @return squared distance between v1 and v2 within the specified precision

private[mllib]def fastSquaredDistance(

v1: Vector,

norm1: Double,

v2: Vector,

norm2: Double,

precision: Double = 1e-6): Double = {

val n = v1.size

require(v2.size == n)

require(norm1 >= 0.0 && norm2 >=0.0)

val sumSquaredNorm = norm1 * norm1 + norm2 * norm2

val normDiff = norm1 - norm2

var sqDist =0.0

* The relative error is

* <pre>

* EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),

* </pre>

* which is bounded by

* <pre>

* 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).

* </pre>

* The bound doesn't need the inner product, so we can use it as a sufficient condition to

* check quickly whether the inner product approach is accurate.

val precisionBound1 =2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)

if (precisionBound1 < precision) {

sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)

} elseif (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {

val dotValue = dot(v1, v2)

sqDist = math.max(sumSquaredNorm - 2.0 * dotValue,0.0)

val precisionBound2 = EPSILON * (sumSquaredNorm +2.0 * math.abs(dotValue)) /

(sqDist + EPSILON)

if (precisionBound2 > precision) {

sqDist = Vectors.sqdist(v1, v2)

}

} else {

sqDist = Vectors.sqdist(v1, v2)

}

sqDist

}

fastSquaredDistance方法會先計算一個精度，有關精度的計算val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)，如果在精度滿足條件的情況下，歐式距離sqDist = sumSquaredNorm - 2.0 * v1.dot(v2)，sumSquaredNorm即爲，2.0 * v1.dot(v2)即爲。這也是之前將norm計算出來的好處。如果精度不滿足要求，則進行原始的距離計算公式了，即調用Vectors.sqdist(v1, v2)。

1.3 Mllib KMeans實例

1、數據

數據格式爲：特徵1 特徵2 特徵3

0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

9.0 9.0 9.0

9.1 9.1 9.1

9.2 9.2 9.2

2、代碼

//1讀取樣本數據

valdata_path ="/home/jb-huangmeiling/kmeans_data.txt"

valdata =sc.textFile(data_path)

valexamples =data.map { line =>

Vectors.dense(line.split(' ').map(_.toDouble))

}.cache()

valnumExamples =examples.count()

println(s"numExamples = $numExamples.")

//2建立模型

valk =2

valmaxIterations =20

valruns =2

valinitializationMode ="k-means||"

valmodel = KMeans.train(examples,k, maxIterations,runs, initializationMode)

//3計算測試誤差

valcost =model.computeCost(examples)

println(s"Total cost = $cost.")

Spark MLlib KMeans聚類算法

1.1 KMeans聚類算法

1.1.1 基礎理論

1.1.2過程演示

1.2 Spark Mllib KMeans源碼分析

1.3 Mllib KMeans實例

Spark MLlib FPGrowth算法

Spark MLlib SVM算法

Spark MLlib Deep Learning Deep Belief Network (深度學習-深度信念網絡)2.2

Spark MLlib Deep Learning Neural Net(深度學習-神經網絡)1.3

Spark Mlib BLAS線性代數運算庫

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結