從零實現機器學習算法（十一）KMeans

1. KMeans簡介

KMeans是一種簡單的聚類方法，它使用每個樣本到聚類中心的距離作爲度量來決定簇。其中值是用戶指定的簇的數目。初始時，隨機選取個點作爲聚類中心(質心)，然後通過不斷修改聚類中心達到最優的效果。由於其計算每個簇的均值作爲質心，因此也成爲K均值。

2. KMeans模型

2.1 KMeans

KMeans算法較爲簡單，記K個簇中心爲 $\mu_{1},\mu_{2},...\mu_{k}$ ，每個簇的樣本數目爲 $N_{1},N_{2},...,N_{k}$ 。KMeans使用誤差平方和作爲目標函數，即

$J\left(\mu_{1},\mu_{2},...\mu_{k}\right)=\frac{1}{2}\sum_{j=1}^{K}\sum_{i=1}^{N_{j}}\left(x_{i}-\mu_{j}\right)^{2}$

對損失函數求偏導，可得

$\frac{\partial J}{\partial\mu_{j}}=-\sum_{i=1}^{N_{j}}\left(x_{i}-\mu_{j}\right)$

令導數等於0，解得

$\mu_{j}=\frac{1}{N_{j}}\sum_{i=1}^{N_{j}}x_{i}$

即令每個簇的均值作爲質心。

KMeans代碼如下：

    def kmeans(self, train_data, k):
        sample_num = len(train_data)
        distances = np.zeros([sample_num, 2])                      # (index, distance)
        centers = self.createCenter(train_data)
        centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
        return centers, distances

其中的adjustCluster()函數，爲確定了初始的質心後的調整過程，功能是最小化損失函數 ,其代碼爲：

 def adjustCluster(self, centers, distances, train_data, k):
        sample_num = len(train_data)
        flag = True  # If True, update cluster_center
        while flag:
            flag = False
            d = np.zeros([sample_num, len(centers)])
            for i in range(len(centers)):
                # calculate the distance between each sample and each cluster center
                d[:, i] = self.calculateDistance(train_data, centers[i])

            # find the minimum distance between each sample and each cluster center
            old_label = distances[:, 0].copy()
            distances[:, 0] = np.argmin(d, axis=1)
            distances[:, 1] = np.min(d, axis=1)
            if np.sum(old_label - distances[:, 0]) != 0:
                flag = True
                # update cluster_center by calculating the mean of each cluster
                for j in range(k):
                    current_cluster = train_data[distances[:, 0] == j]  # find the samples belong to the j-th cluster center
                    if len(current_cluster) != 0:
                        centers[j, :] = np.mean(current_cluster, axis=0)
        return centers, distances

2.2 2分KMeans

由於KMeans可能收斂於局部最小值，爲了解決這一問題引入2分KMeans。2分KMeans原理是先將所有樣本視爲一個大簇，然後將其一分爲二；然後選擇其中一個繼續劃分，直到簇的個數達到了指定的爲止。那麼如何選擇要劃分的簇呢？這裏採用誤差平方和（Sum of Squared Error, SSE）來作爲評價標準。假設現在有個簇，記爲

$C=\left\{c_{1},c_{2},...,c_{n}\right\},n<k$

選擇被劃分的簇流程爲：將中的一個簇 $c_{i}$ 劃分成兩部分 $c_{i1},c_{i2}$ （這兩部分採用普通的KMeans方法）,此時的SSE爲

$SSE_{i}=SSE\left(c_{i1},c_{i2}\right)+SSE\left(C-c_{i}\right)$

適合劃分的簇爲

$index =arg\min SSE_{i}$

然後，如此反覆，直到質心個數等於指定的

代碼如下：

    def biKmeans(self, train_data):
        sample_num = len(train_data)
        distances = np.zeros([sample_num, 2])                                  # (index, distance)
        initial_center = np.mean(train_data, axis=0)                           # initial cluster #shape (1, feature_dim)
        centers = [initial_center]                                             # cluster list

        # clustering with the initial cluster center
        distances[:, 1] = np.power(self.calculateDistance(train_data, initial_center), 2)

        # generate cluster centers
        while len(centers) < self.k:
            # print(len(centers))
            min_SSE  = np.inf
            best_index = None                                                    
            best_centers = None                                                  
            best_distances = None                                                

            # find the best split
            for j in range(len(centers)):
                centerj_data = train_data[distances[:, 0] == j]                  # find the samples belong to the j-th center
                split_centers, split_distances = self.kmeans(centerj_data, 2)    
                split_SSE = np.sum(split_distances[:, 1]) ** 2                   # calculate the distance for after clustering
                other_distances = distances[distances[:, 0] != j]                # the samples don't belong to j-th center
                other_SSE = np.sum(other_distances[:, 1]) ** 2                   # calculate the distance don't belong to j-th center

                # save the best split result
                if (split_SSE + other_SSE) < min_SSE:
                    best_index = j                                               
                    best_centers = split_centers                                 
                    best_distances = split_distances                             
                    min_SSE = split_SSE + other_SSE

            # save the spilt data
            best_distances[best_distances[:, 0] == 1, 0] = len(centers)         
            best_distances[best_distances[:, 0] == 0, 0] = best_index           

            centers[best_index] = best_centers[0, :]                           
            centers.append(best_centers[1, :])                                  
            distances[distances[:, 0] == best_index, :] = best_distances        
        centers = np.array(centers)   # transform form list to array
        return centers, distances

2.3 KMeans++

KMeans 方法由於初始的質心選擇對於聚類算法具有很大的影響，因此引入KMeans++的算法。其原理爲：假設現在有個簇

$C=\left\{c_{1},c_{2},...,c_{n}\right\},n<k$

則在選取第個聚類中心時：距離當前個聚類中心越遠的點會有更高的概率被選爲第個聚類中心。這也符合我們的直覺：聚類中心當然是互相離得越遠越好。首先計算每個樣本點與當前已有的聚類中心的最短距離

$D\left(x_{i}\right)=\min\limits_{c_{j}\in C}\left(x_{i}-c_{j}\right)$

然後計算每個樣本被選爲下一個聚類中心的概率

$p_{i}=\frac{D\left(x_{i}\right)^{2}}{\sum_{x_{i}\in X}D\left(x_{i}\right)^{2}}$

然後採用輪盤法選出下一個聚類中心。個質心選完之後，運行adjustCluster()調整質心。KMeans++代碼如下：

    def kmeansplusplus(self,train_data):
        sample_num = len(train_data)
        distances = np.zeros([sample_num, 2])                                  # (index, distance)

        # randomly select a sample as the initial cluster
        initial_center = train_data[np.random.randint(0, sample_num-1)]
        centers = [initial_center]

        while len(centers) < self.k:
            d = np.zeros([sample_num, len(centers)])
            for i in range(len(centers)):
                # calculate the distance between each sample and each cluster center
                d[:, i] = self.calculateDistance(train_data, centers[i])

            # find the minimum distance between each sample and each cluster center
            distances[:, 0] = np.argmin(d, axis=1)
            distances[:, 1] = np.min(d, axis=1)

            # Roulette Wheel Selection
            prob = np.power(distances[:, 1], 2)/np.sum(np.power(distances[:, 1], 2))
            index = self.rouletteWheelSelection(prob, sample_num)
            new_center = train_data[index, :]
            centers.append(new_center)

        # adjust cluster
        centers = np.array(centers)   # transform form list to array
        centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
        return centers, distances