從零實現機器學習算法(十一)KMeans

目錄

1. KMeans簡介

2. KMeans模型

2.1 KMeans

2.2 2分KMeans

2.3 KMeans++

3. 總結與分析


1. KMeans簡介

KMeans是一種簡單的聚類方法,它使用每個樣本到聚類中心的距離作爲度量來決定簇。其中 K值是用戶指定的簇的數目。初始時,隨機選取 K 個點作爲聚類中心(質心),然後通過不斷修改聚類中心達到最優的效果。由於其計算每個簇的均值作爲質心,因此也成爲K均值。

2. KMeans模型

2.1 KMeans

KMeans算法較爲簡單,記K個簇中心爲 \mu_{1},\mu_{2},...\mu_{k} ,每個簇的樣本數目爲 N_{1},N_{2},...,N_{k} 。KMeans使用誤差平方和作爲目標函數,即

J\left(\mu_{1},\mu_{2},...\mu_{k}\right)=\frac{1}{2}\sum_{j=1}^{K}\sum_{i=1}^{N_{j}}\left(x_{i}-\mu_{j}\right)^{2}

對損失函數求偏導,可得

\frac{\partial J}{\partial\mu_{j}}=-\sum_{i=1}^{N_{j}}\left(x_{i}-\mu_{j}\right)

令導數等於0,解得

\mu_{j}=\frac{1}{N_{j}}\sum_{i=1}^{N_{j}}x_{i}

即令每個簇的均值作爲質心。

KMeans代碼如下:

    def kmeans(self, train_data, k):
        sample_num = len(train_data)
        distances = np.zeros([sample_num, 2])                      # (index, distance)
        centers = self.createCenter(train_data)
        centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
        return centers, distances

其中的adjustCluster()函數,爲確定了初始的質心後的調整過程,功能是最小化損失函數 J ,其代碼爲:

 def adjustCluster(self, centers, distances, train_data, k):
        sample_num = len(train_data)
        flag = True  # If True, update cluster_center
        while flag:
            flag = False
            d = np.zeros([sample_num, len(centers)])
            for i in range(len(centers)):
                # calculate the distance between each sample and each cluster center
                d[:, i] = self.calculateDistance(train_data, centers[i])

            # find the minimum distance between each sample and each cluster center
            old_label = distances[:, 0].copy()
            distances[:, 0] = np.argmin(d, axis=1)
            distances[:, 1] = np.min(d, axis=1)
            if np.sum(old_label - distances[:, 0]) != 0:
                flag = True
                # update cluster_center by calculating the mean of each cluster
                for j in range(k):
                    current_cluster = train_data[distances[:, 0] == j]  # find the samples belong to the j-th cluster center
                    if len(current_cluster) != 0:
                        centers[j, :] = np.mean(current_cluster, axis=0)
        return centers, distances

2.2 2分KMeans

由於KMeans可能收斂於局部最小值,爲了解決這一問題引入2分KMeans。2分KMeans原理是先將所有樣本視爲一個大簇,然後將其一分爲二;然後選擇其中一個繼續劃分,直到簇的個數達到了指定的 K 爲止。那麼如何選擇要劃分的簇呢?這裏採用誤差平方和(Sum of Squared Error, SSE)來作爲評價標準。假設現在有 n 個簇,記爲

C=\left\{c_{1},c_{2},...,c_{n}\right\},n<k

選擇被劃分的簇流程爲:將 C 中的一個簇 c_{i} 劃分成兩部分 c_{i1},c_{i2} (這兩部分採用普通的KMeans方法),此時的SSE爲

SSE_{i}=SSE\left(c_{i1},c_{i2}\right)+SSE\left(C-c_{i}\right)

適合劃分的簇爲

index =arg\min SSE_{i}

然後,如此反覆,直到質心個數等於指定的 K

代碼如下:

    def biKmeans(self, train_data):
        sample_num = len(train_data)
        distances = np.zeros([sample_num, 2])                                  # (index, distance)
        initial_center = np.mean(train_data, axis=0)                           # initial cluster #shape (1, feature_dim)
        centers = [initial_center]                                             # cluster list

        # clustering with the initial cluster center
        distances[:, 1] = np.power(self.calculateDistance(train_data, initial_center), 2)

        # generate cluster centers
        while len(centers) < self.k:
            # print(len(centers))
            min_SSE  = np.inf
            best_index = None                                                    
            best_centers = None                                                  
            best_distances = None                                                

            # find the best split
            for j in range(len(centers)):
                centerj_data = train_data[distances[:, 0] == j]                  # find the samples belong to the j-th center
                split_centers, split_distances = self.kmeans(centerj_data, 2)    
                split_SSE = np.sum(split_distances[:, 1]) ** 2                   # calculate the distance for after clustering
                other_distances = distances[distances[:, 0] != j]                # the samples don't belong to j-th center
                other_SSE = np.sum(other_distances[:, 1]) ** 2                   # calculate the distance don't belong to j-th center

                # save the best split result
                if (split_SSE + other_SSE) < min_SSE:
                    best_index = j                                               
                    best_centers = split_centers                                 
                    best_distances = split_distances                             
                    min_SSE = split_SSE + other_SSE

            # save the spilt data
            best_distances[best_distances[:, 0] == 1, 0] = len(centers)         
            best_distances[best_distances[:, 0] == 0, 0] = best_index           

            centers[best_index] = best_centers[0, :]                           
            centers.append(best_centers[1, :])                                  
            distances[distances[:, 0] == best_index, :] = best_distances        
        centers = np.array(centers)   # transform form list to array
        return centers, distances

2.3 KMeans++

KMeans 方法由於初始的質心選擇對於聚類算法具有很大的影響,因此引入KMeans++的算法。其原理爲:假設現在有 n 個簇

C=\left\{c_{1},c_{2},...,c_{n}\right\},n<k

則在選取第 n+1 個聚類中心時:距離當前 n 個聚類中心越遠的點會有更高的概率被選爲第 n+1 個聚類中心。這也符合我們的直覺:聚類中心當然是互相離得越遠越好。首先計算每個樣本點與當前已有的聚類中心的最短距離

D\left(x_{i}\right)=\min\limits_{c_{j}\in C}\left(x_{i}-c_{j}\right)

然後計算每個樣本被選爲下一個聚類中心的概率

p_{i}=\frac{D\left(x_{i}\right)^{2}}{\sum_{x_{i}\in X}D\left(x_{i}\right)^{2}}

然後採用輪盤法選出下一個聚類中心。 K 個質心選完之後,運行adjustCluster()調整質心。KMeans++代碼如下:

    def kmeansplusplus(self,train_data):
        sample_num = len(train_data)
        distances = np.zeros([sample_num, 2])                                  # (index, distance)

        # randomly select a sample as the initial cluster
        initial_center = train_data[np.random.randint(0, sample_num-1)]
        centers = [initial_center]

        while len(centers) < self.k:
            d = np.zeros([sample_num, len(centers)])
            for i in range(len(centers)):
                # calculate the distance between each sample and each cluster center
                d[:, i] = self.calculateDistance(train_data, centers[i])

            # find the minimum distance between each sample and each cluster center
            distances[:, 0] = np.argmin(d, axis=1)
            distances[:, 1] = np.min(d, axis=1)

            # Roulette Wheel Selection
            prob = np.power(distances[:, 1], 2)/np.sum(np.power(distances[:, 1], 2))
            index = self.rouletteWheelSelection(prob, sample_num)
            new_center = train_data[index, :]
            centers.append(new_center)

        # adjust cluster
        centers = np.array(centers)   # transform form list to array
        centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
        return centers, distances

3. 總結與分析

聚類算法收斂後可以使用一些方法來調整質心。對於 K 值的選擇也有一些方法來確定,如採用輪廓係數(Silhouette Coefficient)來確定。最後看下三種聚類方法的效果。

 

 

 

可以發現KMeans++的算法運行效果最好,這三種方法運行時間差不多。

 

本文相關代碼和數據集: https://github.com/Ryuk17/MachineLearning

 

[1] Peter Harrington, Machine Learning IN ACTION

[2] Andrew Ng, CS229 Lecture notes

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章