目錄
1. KMeans簡介
KMeans是一種簡單的聚類方法,它使用每個樣本到聚類中心的距離作爲度量來決定簇。其中 值是用戶指定的簇的數目。初始時,隨機選取 個點作爲聚類中心(質心),然後通過不斷修改聚類中心達到最優的效果。由於其計算每個簇的均值作爲質心,因此也成爲K均值。
2. KMeans模型
2.1 KMeans
KMeans算法較爲簡單,記K個簇中心爲 ,每個簇的樣本數目爲 。KMeans使用誤差平方和作爲目標函數,即
對損失函數求偏導,可得
令導數等於0,解得
即令每個簇的均值作爲質心。
KMeans代碼如下:
def kmeans(self, train_data, k):
sample_num = len(train_data)
distances = np.zeros([sample_num, 2]) # (index, distance)
centers = self.createCenter(train_data)
centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
return centers, distances
其中的adjustCluster()函數,爲確定了初始的質心後的調整過程,功能是最小化損失函數 ,其代碼爲:
def adjustCluster(self, centers, distances, train_data, k):
sample_num = len(train_data)
flag = True # If True, update cluster_center
while flag:
flag = False
d = np.zeros([sample_num, len(centers)])
for i in range(len(centers)):
# calculate the distance between each sample and each cluster center
d[:, i] = self.calculateDistance(train_data, centers[i])
# find the minimum distance between each sample and each cluster center
old_label = distances[:, 0].copy()
distances[:, 0] = np.argmin(d, axis=1)
distances[:, 1] = np.min(d, axis=1)
if np.sum(old_label - distances[:, 0]) != 0:
flag = True
# update cluster_center by calculating the mean of each cluster
for j in range(k):
current_cluster = train_data[distances[:, 0] == j] # find the samples belong to the j-th cluster center
if len(current_cluster) != 0:
centers[j, :] = np.mean(current_cluster, axis=0)
return centers, distances
2.2 2分KMeans
由於KMeans可能收斂於局部最小值,爲了解決這一問題引入2分KMeans。2分KMeans原理是先將所有樣本視爲一個大簇,然後將其一分爲二;然後選擇其中一個繼續劃分,直到簇的個數達到了指定的 爲止。那麼如何選擇要劃分的簇呢?這裏採用誤差平方和(Sum of Squared Error, SSE)來作爲評價標準。假設現在有 個簇,記爲
選擇被劃分的簇流程爲:將 中的一個簇 劃分成兩部分 (這兩部分採用普通的KMeans方法),此時的SSE爲
適合劃分的簇爲
然後,如此反覆,直到質心個數等於指定的
代碼如下:
def biKmeans(self, train_data):
sample_num = len(train_data)
distances = np.zeros([sample_num, 2]) # (index, distance)
initial_center = np.mean(train_data, axis=0) # initial cluster #shape (1, feature_dim)
centers = [initial_center] # cluster list
# clustering with the initial cluster center
distances[:, 1] = np.power(self.calculateDistance(train_data, initial_center), 2)
# generate cluster centers
while len(centers) < self.k:
# print(len(centers))
min_SSE = np.inf
best_index = None
best_centers = None
best_distances = None
# find the best split
for j in range(len(centers)):
centerj_data = train_data[distances[:, 0] == j] # find the samples belong to the j-th center
split_centers, split_distances = self.kmeans(centerj_data, 2)
split_SSE = np.sum(split_distances[:, 1]) ** 2 # calculate the distance for after clustering
other_distances = distances[distances[:, 0] != j] # the samples don't belong to j-th center
other_SSE = np.sum(other_distances[:, 1]) ** 2 # calculate the distance don't belong to j-th center
# save the best split result
if (split_SSE + other_SSE) < min_SSE:
best_index = j
best_centers = split_centers
best_distances = split_distances
min_SSE = split_SSE + other_SSE
# save the spilt data
best_distances[best_distances[:, 0] == 1, 0] = len(centers)
best_distances[best_distances[:, 0] == 0, 0] = best_index
centers[best_index] = best_centers[0, :]
centers.append(best_centers[1, :])
distances[distances[:, 0] == best_index, :] = best_distances
centers = np.array(centers) # transform form list to array
return centers, distances
2.3 KMeans++
KMeans 方法由於初始的質心選擇對於聚類算法具有很大的影響,因此引入KMeans++的算法。其原理爲:假設現在有 個簇
則在選取第 個聚類中心時:距離當前 個聚類中心越遠的點會有更高的概率被選爲第 個聚類中心。這也符合我們的直覺:聚類中心當然是互相離得越遠越好。首先計算每個樣本點與當前已有的聚類中心的最短距離
然後計算每個樣本被選爲下一個聚類中心的概率
然後採用輪盤法選出下一個聚類中心。 個質心選完之後,運行adjustCluster()調整質心。KMeans++代碼如下:
def kmeansplusplus(self,train_data):
sample_num = len(train_data)
distances = np.zeros([sample_num, 2]) # (index, distance)
# randomly select a sample as the initial cluster
initial_center = train_data[np.random.randint(0, sample_num-1)]
centers = [initial_center]
while len(centers) < self.k:
d = np.zeros([sample_num, len(centers)])
for i in range(len(centers)):
# calculate the distance between each sample and each cluster center
d[:, i] = self.calculateDistance(train_data, centers[i])
# find the minimum distance between each sample and each cluster center
distances[:, 0] = np.argmin(d, axis=1)
distances[:, 1] = np.min(d, axis=1)
# Roulette Wheel Selection
prob = np.power(distances[:, 1], 2)/np.sum(np.power(distances[:, 1], 2))
index = self.rouletteWheelSelection(prob, sample_num)
new_center = train_data[index, :]
centers.append(new_center)
# adjust cluster
centers = np.array(centers) # transform form list to array
centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
return centers, distances
3. 總結與分析
聚類算法收斂後可以使用一些方法來調整質心。對於 值的選擇也有一些方法來確定,如採用輪廓係數(Silhouette Coefficient)來確定。最後看下三種聚類方法的效果。
可以發現KMeans++的算法運行效果最好,這三種方法運行時間差不多。
本文相關代碼和數據集: https://github.com/Ryuk17/MachineLearning
[1] Peter Harrington, Machine Learning IN ACTION
[2] Andrew Ng, CS229 Lecture notes