機器學習-KMeans聚類(肘係數Elbow和輪廓係數Silhouette)

Section I: Brief Introduction on KMeans Cluster

The K-Means algorithm belongs to the category of prototype-based clustering. Prototype-based clustering means that each cluster is represented by a prototype, which can either be the centorid (average) of similar points with continuous features, or the medoid (the most frequently occurring point) in the case of categorical features. While K-Means is very good at identifying clusters with a spherical shape, one of the drawbacks of this clutering algorithm is that the number of clusters need to be specified. An inapproriate choice for cluter number can result in poor clustering performance, so two indexes for model performance, i.e., elbow and silhouette, are useful techniques to evaluate the quality of clutering to determine the optimal number of cluters.
The flowchart of K-Means algorithm can be summarized by the following four steps:

  • Step 1: Randomly pick k centroids from the sample points as initial cluter centers
  • Step 2: Assign each sample to the nearest centroid according to distance difference, and then move the centroids to the center of the samples that were assigned to it
  • Step 3: Repeat steps 2 until the cluster assignments do not change or user-defined tolerance or maximum number of iterations is reached; otherwise, update centroids.

FROM
Sebastian Raschka, Vahid Mirjalili. Python機器學習第二版. 南京:東南大學出版社,2018.

第一部分:基本KMeans聚類算法
代碼

from sklearn import datasets

import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

plt.rcParams['figure.dpi']=200
plt.rcParams['savefig.dpi']=200
font = {'weight': 'light'}
plt.rc("font", **font)

#Section 1: Load Blobs from datasets and visualize it
X,y=datasets.make_blobs(n_samples=150,
                        n_features=2,
                        centers=3,
                        cluster_std=0.5,
                        shuffle=True,
                        random_state=0)
plt.scatter(X[:,0],X[:,1],c='white',marker='o',edgecolors='black',s=50)
plt.grid()
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.savefig('./fig1.png')
plt.show()

#Section 2: Use KMeans algorithm to visualize data points and centroids
from sklearn.cluster import KMeans

#Set n_init=10 to run k-means clustering algorithm 10 times independently
#with different centroids to choose the final model with the lowest SSE
km=KMeans(n_clusters=3,
          init='random',
          n_init=10,
          max_iter=300,
          tol=1e-4,
          random_state=0)

y_km=km.fit_predict(X)
plt.scatter(X[y_km==0,0],
            X[y_km==0,1],
            s=50,
            c='lightgreen',
            marker='s',
            edgecolor='black',
            label='Cluster 1')
plt.scatter(X[y_km==1,0],
            X[y_km==1,1],
            s=50,
            c='orange',
            marker='o',
            edgecolor='black',
            label='Cluster 2')
plt.scatter(X[y_km==2,0],
            X[y_km==2,1],
            s=50,
            c='lightblue',
            marker='v',
            edgecolor='black',
            label='Cluster 3')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],
            s=250,
            marker='*',
            c='red',
            edgecolor='black',
            label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend(loc='best')
plt.grid()
plt.savefig('./fig1.png')
plt.show()

結果

初始數據中心分佈:
在這裏插入圖片描述
採用KMeans聚類後,聚類中心分佈:
在這裏插入圖片描述
顯然,KMeans聚類算法可以較好地將Blob簇分爲三類。

第二部分:KMeans算法性能評價

指標一:Elbow

One of the main challenge in unsupervised learning is that we do not know the definite answer.Thus, to quantify the quality of clustering, intrinsic metrics - such as the within-cluster (SSE) distortion - to compare the performance of different k-means clusterings.
Intuitively, if k increases, the distortion will decrease. This is because the samples will be closer to the centroids they are assigned to. The idea behind the elbow method is to identify the value of k where the distortion begins to increase most rapidly, which will be clearer if the distortion for different values k is depicted.

FROM
Sebastian Raschka, Vahid Mirjalili. Python機器學習第二版. 南京:東南大學出版社,2018.

代碼

#Section 3: Elbow metric to evaluate model performance
print("Distortion: %.2f" % km.inertia_)

distortions=[]
for i in range(1,11):
    km=KMeans(n_clusters=i,
              init='k-means++',
              n_init=10,
              max_iter=300,
              random_state=0)
    km.fit(X)
    distortions.append(km.inertia_)
plt.plot(range(1,11),distortions,marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Distortion")
plt.grid()
plt.savefig("./fig3.png")
plt.show()

值得注意的是,這裏“k-means++”的使用是爲了避免傳統k-means的中心初始化問題。其在於遞進選擇聚類中心,每更新一個聚類中心,都需採用符合特定概率的分佈,以選擇和更新聚類中心。

結果
在這裏插入圖片描述
Distortion距離爲聚類後,各點與聚類中心的差異總和。換言之,Distortion距離爲各類內部距離總和的彙總

Distortion: 72.48

指標二:Silhouette

Another intrinsic metric to evaluate the quality of a clustering is silhouette analysis, which can also be applied to clustering algorithms other than k-means. Silhouette analysis can be used as a graphical tool to plot a measure of how tightly grouped the samples in the clusters are. To calculate the silhouette coefficient of a single sample in our dataset, its implementation can be summarized in the following three steps:

  • Step 1: Calculate the cluster cohesion as the average distance between a sample and all other points in the same cluster
  • Step 2: Calculate the cluster sparation from the next closesr cluster as the average distance between the sample and all samples in the nearest cluster
  • Step 3: Calculate the silhouette as the difference between cluster cohesion and separation divided by the greater of the two.

Silhouette score: = (separation score-cohesion score)/max(separation score, cohesion score)
Silhouette score is bounded in the range of -1 to 1. The closest to 1, separation score is the largest, which indicates that an ideal silhouette coefficient is obtained, since separation score quantifies how dissimilar a sample is to other clusters and cohesion score tells how similar it is to the other samples in its own cluster.

對比一,聚類數量,設置爲3

代碼

#Section 4: Silhouette metric to evaluate model performance
#Section 4.1: Set cluster=3
import numpy as np
from matplotlib import cm
from sklearn.metrics import silhouette_score,silhouette_samples

km=KMeans(n_clusters=3,
          init='k-means++',
          n_init=10,
          max_iter=300,
          tol=1e-4,
          random_state=0)
y_km=km.fit_predict(X)

cluster_labels=np.unique(y_km)
n_clusters=cluster_labels.shape[0]
silhouette_score_cluster_3=silhouette_score(X,km.labels_)
print("Silhouette Score When Cluster Number Set to 3: %.3f" % silhouette_score_cluster_3)
silhouette_vals=silhouette_samples(X,y_km,metric='euclidean')
y_ax_lower,y_ax_upper=0,0
yticks=[]
for i,c in enumerate(cluster_labels):
    c_silhouette_vals=silhouette_vals[y_km==c]
    c_silhouette_vals.sort()
    y_ax_upper+=len(c_silhouette_vals)
    color=cm.jet(float(i)/n_clusters)
    plt.barh(range(y_ax_lower,y_ax_upper),
             c_silhouette_vals,
             height=1.0,
             edgecolor='none',
             color=color)
    yticks.append((y_ax_lower+y_ax_upper)/2.0)
    y_ax_lower+=len(c_silhouette_vals)

silhouette_avg=np.mean(silhouette_vals)
plt.axvline(silhouette_avg,
            color='red',
            linestyle='--')
plt.yticks(yticks,cluster_labels+1)
plt.ylabel("Cluster")
plt.xlabel("Silhouette Coefficients")
plt.savefig('./fig4.png')
plt.show()

結果
在這裏插入圖片描述
Silhouette分數,如下:

Silhouette Score When Cluster Number Set to 3: 0.714

由上述結果,可以得知平均輪廓係數silhouette趨於0.72左右,亦即類間差異較大,而類內部差異則較小。由此說明,當前設置聚類數量爲3,是比較合適的,也與elbow肘係數下降梯度最大爲3,是比較契合的。

對比二,聚類數量,設置爲2

代碼

#Section 4.2: Set cluster=2
import numpy as np
from matplotlib import cm
from sklearn.metrics import silhouette_score,silhouette_samples

km=KMeans(n_clusters=2,
          init='k-means++',
          n_init=10,
          max_iter=300,
          tol=1e-4,
          random_state=0)
y_km=km.fit_predict(X)

cluster_labels=np.unique(y_km)
n_clusters=cluster_labels.shape[0]
silhouette_score_cluster_2=silhouette_score(X,km.labels_)
print("Silhouette Score When Cluster Number Set to 2: %.3f" % silhouette_score_cluster_2)
silhouette_vals=silhouette_samples(X,y_km,metric='euclidean')
y_ax_lower,y_ax_upper=0,0
yticks=[]
for i,c in enumerate(cluster_labels):
    c_silhouette_vals=silhouette_vals[y_km==c]
    c_silhouette_vals.sort()
    y_ax_upper+=len(c_silhouette_vals)
    color=cm.jet(float(i)/n_clusters)
    plt.barh(range(y_ax_lower,y_ax_upper),
             c_silhouette_vals,
             height=1.0,
             edgecolor='none',
             color=color)
    yticks.append((y_ax_lower+y_ax_upper)/2.0)
    y_ax_lower+=len(c_silhouette_vals)

silhouette_avg=np.mean(silhouette_vals)
plt.axvline(silhouette_avg,
            color='red',
            linestyle='--')
plt.yticks(yticks,cluster_labels+1)
plt.ylabel("Cluster")
plt.xlabel("Silhouette Coefficients")
plt.savefig('./fig5.png')
plt.show()

結果
在這裏插入圖片描述
Silhouette分數,如下:

Silhouette Score When Cluster Number Set to 2: 0.585

對比聚類數量設定爲3時,聚類中心爲2的輪廓係數更低,趨於0.585,亦即類間差異與類內部差異相差幅度不如聚類數量設定爲3的結果,由此說明聚類數量設定爲2的聚類效果不佳。

FROM
Sebastian Raschka, Vahid Mirjalili. Python機器學習第二版. 南京:東南大學出版社,2018.

參考文獻
Sebastian Raschka, Vahid Mirjalili. Python機器學習第二版. 南京:東南大學出版社,2018.

附錄

from sklearn import datasets
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

plt.rcParams['figure.dpi']=200
plt.rcParams['savefig.dpi']=200
font = {'weight': 'light'}
plt.rc("font", **font)

#Section 1: Load Blobs from datasets and visualize it
X,y=datasets.make_blobs(n_samples=150,
                        n_features=2,
                        centers=3,
                        cluster_std=0.5,
                        shuffle=True,
                        random_state=0)
plt.scatter(X[:,0],X[:,1],c='white',marker='o',edgecolors='black',s=50)
plt.grid()
plt.savefig('./fig1.png')
plt.show()

#Section 2: Use KMeans algorithm to visualize data points and centroids
from sklearn.cluster import KMeans

#Set n_init=10 to run k-means clustering algorithm 10 times independently
#with different centroids to choose the final model with the lowest SSE
km=KMeans(n_clusters=3,
          init='random',
          n_init=10,
          max_iter=300,
          tol=1e-4,
          random_state=0)

y_km=km.fit_predict(X)
plt.scatter(X[y_km==0,0],
            X[y_km==0,1],
            s=50,
            c='lightgreen',
            marker='s',
            edgecolor='black',
            label='Cluster 1')
plt.scatter(X[y_km==1,0],
            X[y_km==1,1],
            s=50,
            c='orange',
            marker='o',
            edgecolor='black',
            label='Cluster 2')
plt.scatter(X[y_km==2,0],
            X[y_km==2,1],
            s=50,
            c='lightblue',
            marker='v',
            edgecolor='black',
            label='Cluster 3')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],
            s=250,
            marker='*',
            c='red',
            edgecolor='black',
            label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend(loc='best')
plt.grid()
plt.savefig('./fig2.png')
plt.show()

#Section 3: Elbow metric to evaluate model performance
print("Distortion: %.2f" % km.inertia_)

distortions=[]
for i in range(1,11):
    km=KMeans(n_clusters=i,
              init='k-means++',
              n_init=10,
              max_iter=300,
              random_state=0)
    km.fit(X)
    distortions.append(km.inertia_)
plt.plot(range(1,11),distortions,marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Distortion")
plt.grid()
plt.savefig("./fig3.png")
plt.show()

#Section 4: Silhouette metric to evaluate model performance
#Section 4.1: Set cluster=3
import numpy as np
from matplotlib import cm
from sklearn.metrics import silhouette_score,silhouette_samples

km=KMeans(n_clusters=3,
          init='k-means++',
          n_init=10,
          max_iter=300,
          tol=1e-4,
          random_state=0)
y_km=km.fit_predict(X)

cluster_labels=np.unique(y_km)
n_clusters=cluster_labels.shape[0]
silhouette_score_cluster_3=silhouette_score(X,km.labels_)
print("Silhouette Score When Cluster Number Set to 3: %.3f" % silhouette_score_cluster_3)
silhouette_vals=silhouette_samples(X,y_km,metric='euclidean')
y_ax_lower,y_ax_upper=0,0
yticks=[]
for i,c in enumerate(cluster_labels):
    c_silhouette_vals=silhouette_vals[y_km==c]
    c_silhouette_vals.sort()
    y_ax_upper+=len(c_silhouette_vals)
    color=cm.jet(float(i)/n_clusters)
    plt.barh(range(y_ax_lower,y_ax_upper),
             c_silhouette_vals,
             height=1.0,
             edgecolor='none',
             color=color)
    yticks.append((y_ax_lower+y_ax_upper)/2.0)
    y_ax_lower+=len(c_silhouette_vals)

silhouette_avg=np.mean(silhouette_vals)
plt.axvline(silhouette_avg,
            color='red',
            linestyle='--')
plt.yticks(yticks,cluster_labels+1)
plt.ylabel("Cluster")
plt.xlabel("Silhouette Coefficients")
plt.savefig('./fig4.png')
plt.show()

#Section 4.2: Set cluster=2
import numpy as np
from matplotlib import cm
from sklearn.metrics import silhouette_score,silhouette_samples

km=KMeans(n_clusters=2,
          init='k-means++',
          n_init=10,
          max_iter=300,
          tol=1e-4,
          random_state=0)
y_km=km.fit_predict(X)

cluster_labels=np.unique(y_km)
n_clusters=cluster_labels.shape[0]
silhouette_score_cluster_2=silhouette_score(X,km.labels_)
print("Silhouette Score When Cluster Number Set to 2: %.3f" % silhouette_score_cluster_2)
silhouette_vals=silhouette_samples(X,y_km,metric='euclidean')
y_ax_lower,y_ax_upper=0,0
yticks=[]
for i,c in enumerate(cluster_labels):
    c_silhouette_vals=silhouette_vals[y_km==c]
    c_silhouette_vals.sort()
    y_ax_upper+=len(c_silhouette_vals)
    color=cm.jet(float(i)/n_clusters)
    plt.barh(range(y_ax_lower,y_ax_upper),
             c_silhouette_vals,
             height=1.0,
             edgecolor='none',
             color=color)
    yticks.append((y_ax_lower+y_ax_upper)/2.0)
    y_ax_lower+=len(c_silhouette_vals)

silhouette_avg=np.mean(silhouette_vals)
plt.axvline(silhouette_avg,
            color='red',
            linestyle='--')
plt.yticks(yticks,cluster_labels+1)
plt.ylabel("Cluster")
plt.xlabel("Silhouette Coefficients")
plt.savefig('./fig5.png')
plt.show()
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章