基於 AutoEncoder 的無監督聚類的實現

原文：How to do Unsupervised Clustering with Keras

鑑於深度學習出色的非線性表徵能力，其被普遍用於進行從輸入到給定標籤數據集的輸出的映射，即：圖像分類，需要有人工標註標籤的數據集.但是，不管是對 XRay 圖像的標註，還是對新聞報道的主題的標註，都依賴於人工進行，尤其是針對大規模數據集，其工作量很大，費時費力。

聚類分析，也叫聚類(clustering)，是一種無監督機器學習技術，其不需要帶標註的標籤數據集，只需根據數據樣本的相似性對數據集進行分組。

聚類機器學習中，需要關注的技術，其原因如下。

1. 聚類的應用場景

[1] - 推薦系統

通過對用戶購物記錄的學習，聚類模型可以根據相似性對用戶分組，以有助於尋找相似興趣的用戶，或用戶感興趣的相關產品。

[2] - 生物學中的序列聚類

序列聚類算法(sequence clustering)

生物序列根據相似性分組，其根據氨基酸(amino acid content) 含量對蛋白(proteins)進行聚類。

[3] - 圖像或視頻聚類分析

基於相似性將圖像或視頻進行聚類分析，以分組.

[4] - 醫療數據庫中的應用

在醫療數據集場景中，每個病人可能包含不同的特定測試(如葡萄糖glucose，膽固醇cholesterol)。首先對病人進行聚類分析，有助於理解有價值的特徵，以減少特徵稀疏性；以及提升如癌症病人生存預測的分類任務上的準確性。

[5] - 通用場景

聚類可以得到數據的更緊湊彙總，以用於分類，模式發現，假設生成及測試。

對於數據科學家而言，聚類是非常有價值的。

2. 如何生成好的聚類

一個好的聚類方法應該生成高質量的聚類，其特點如：

[1] - 羣組內部的高相似性：羣組內的緊密聚合

High intra-class similarity: Cohesive within clusters

[2] - 羣組之間的低相似性：羣組之間各不相同

Low inter-class similarity: Distinctive between clusters

3. 採用 K-Means 設置 baseline

傳統的 K-Means 算法具有較快的速度，並應用於各種問題。然而，K-Means 算法的距離度量受限於原始數據空間. 。的維度較高時，如圖像數據，算法效率會較低。

以 MNIST 手寫數據集爲例，訓練 K-Means 模型來進行聚類爲 10 個組：

from sklearn.cluster import KMeans
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x = np.concatenate((x_train, x_test))
y = np.concatenate((y_train, y_test))
x = x.reshape((x.shape[0], -1))
x = np.divide(x, 255.)
# 10 clusters
n_clusters = len(np.unique(y))
# Runs in parallel 4 CPUs
kmeans = KMeans(n_clusters=n_clusters, n_init=20, n_jobs=4)
# Train K-Means.
y_pred_kmeans = kmeans.fit_predict(x)
# Evaluate the K-Means clustering accuracy.
metrics.acc(y, y_pred_kmeans)

得到 K-Means 聚類算法的準確度爲 53.2%。後會將它與深度嵌入聚類模型(deep embedding clustering model)進行對比分析。

深度嵌入聚類模型主要包括：

[1] - 一個自動編碼器，預訓練，以學習無標籤數據集的初始壓縮後的特徵表示.

[2] - 在編碼器上堆積聚類層(clustering)，以分配編碼器輸出到一個聚類組. 聚類層的權重初始化採用的是基於當前得到的 K-Means 聚類中心.

[3] - 聚類模型的訓練，以同時改善聚類層和編碼器。

4. 預訓練自編碼器

自動編碼器是一種數據壓縮算法，其主要包括編碼器和解碼器兩個部分。

編碼器將輸入數據壓縮爲較低維度的特徵. 如，一張 28x28 的 MNIST 圖像總共有 784 個像素，編碼器可以將其壓縮爲 10 個浮點數組成的數組。這些浮點數稱作圖像的特徵。

而解碼器採用壓縮後的特徵作爲輸入，並儘可能的重建與原始圖像儘可能相似的圖像。

自動編碼器是一中無監督學習算法，其訓練只需要圖像本身，而不需要標註標籤。

構建的自動編碼器是一個全連接對稱模型，其對稱性在於，圖像的壓縮和解壓過程是一組完全對應的相反過程。

訓練自動編碼器 300 個 epochs，並保存模型權重：

autoencoder.fit(x, x, batch_size=256, epochs=300) #, callbacks=cb)
autoencoder.save_weights('./results/ae_weights.h5')

5. 聚類模型

自動編碼器訓練後，其編碼器部分將每幅圖像壓縮成 10 個浮點數。對此，因爲輸入數據的維度降低到了 10，因此可以採用 K-Means 算法生成聚類中心，其是 10 維特徵空間的 10 個聚類中心。

但，這裏還構建了自定義的聚類層，以將輸入特徵轉換爲聚類標籤概率。

聚類標籤概率的計算採用的是 t-分佈

T-分佈，與 t-SNE 算法中的應用一致，其度量了中心點和嵌入點之間的相似性。

自定義的模型聚類層，類似於 K-Means 聚類，其權重表示聚類中心，其根據訓練的 K-Means 進行初始化。

Keras 中創建自定義網絡層，主要包括三種實現方法：

[1] - build(input_shape) - 定義網絡層的權重，這裏是 10-D 特徵空間的 10 個聚類，即，10x10 個權重變量.

[2] - call(x) - 網絡層邏輯定義，即，將特徵映射到聚類標籤。

[3] - compute_output_shape(input_shape) - 指定輸入 shapes 到輸出 shapes 的 shape 變換邏輯。

如：

class ClusteringLayer(Layer):
    """
    Clustering layer converts input sample (feature) to soft label.
    # Example
    model.add(ClusteringLayer(n_clusters=10))

    # Arguments
        n_clusters: number of clusters.
        weights: list of Numpy array with shape `(n_clusters, n_features)` witch represents the initial cluster centers.
        alpha: degrees of freedom parameter in Student's t-distribution. Default to 1.0.
    # Input shape
        2D tensor with shape: `(n_samples, n_features)`.
    # Output shape
        2D tensor with shape: `(n_samples, n_clusters)`.
    """
    
    def __init__(self, n_clusters, weights=None, alpha=1.0, **kwargs):
        if 'input_shape' not in kwargs and 'input_dim' in kwargs:
            kwargs['input_shape'] = (kwargs.pop('input_dim'),)
        super(ClusteringLayer, self).__init__(**kwargs)
        self.n_clusters = n_clusters
        self.alpha = alpha
        self.initial_weights = weights
        self.input_spec = InputSpec(ndim=2)
    
    def build(self, input_shape):
        assert len(input_shape) == 2
        input_dim = input_shape[1]
        self.input_spec = InputSpec(dtype=K.floatx(), shape=(None, input_dim))
        self.clusters = self.add_weight((self.n_clusters, input_dim), 
                                        initializer='glorot_uniform', 
                                        name='clusters')
        if self.initial_weights is not None:
            self.set_weights(self.initial_weights)
            del self.initial_weights
        self.built = True
    
    def call(self, inputs, **kwargs):
        """ student t-distribution, as same as used in t-SNE algorithm.        
                 q_ij = 1/(1+dist(x_i, µ_j)^2), then normalize it.
                 q_ij can be interpreted as the probability of assigning sample i to cluster j.
                 (i.e., a soft assignment)
        Arguments:
            inputs: the variable containing data, shape=(n_samples, n_features)
        Return:
            q: student's t-distribution, or soft labels for each sample. shape=(n_samples, n_clusters)
        """
        q = 1.0 / (1.0 + (K.sum(K.square(K.expand_dims(inputs, axis=1) - self.clusters), axis=2) / self.alpha))
        q **= (self.alpha + 1.0) / 2.0
        q = K.transpose(K.transpose(q) / K.sum(q, axis=1)) # Make sure each sample's 10 values add up to 1.
        return q
    
    def compute_output_shape(self, input_shape):
        assert input_shape and len(input_shape) == 2
        return input_shape[0], self.n_clusters
    
    def get_config(self):
        config = {'n_clusters': self.n_clusters}
        base_config = super(ClusteringLayer, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

然後，預訓練的編碼器後堆疊聚類層，以形成聚類模型。

對於聚類層，採用 K-Means 對所有圖像的特徵向量進行訓練，得到的聚類中心初始化聚類層權重。

clustering_layer = ClusteringLayer(n_clusters, name='clustering')(encoder.output)
model = Model(inputs=encoder.input, outputs=clustering_layer)
# Initialize cluster centers using k-means.
kmeans = KMeans(n_clusters=n_clusters, n_init=20)
y_pred = kmeans.fit_predict(encoder.predict(x))
model.get_layer(name='clustering').set_weights([kmeans.cluster_centers_])

6. 訓練聚類模型

6.1. 輔助目標分佈和 KL 散度損失

(Auxiliary target distribution and KL divergence loss)

接着要做的是，同時提升聚類和特徵表示的效果。爲此，將定義一個基於質心的目標概率分佈( centroid-based target probability distribution)，並根據模型聚類結果最小化 KL 散度。

目標分佈應該具有以下屬性：

[1] - 加強預測，如，提升聚類精度。

Strengthen predictions, i.e., improve cluster purity.

[2] - 更關注於高置信度的數據樣本。

Put more emphasis on data points assigned with high confidence.

[3] - 避免大聚類組干擾隱藏特徵空間。

Prevent large clusters from distorting the hidden feature space.

目標分佈的計算，首先將 q(編碼的特徵向量) 提升到第二冪(second power)，然後，根據每個聚類組的頻率進行歸一化.(The target distribution is computed by first raising q (the encoded feature vectors) to the second power and then normalizing by frequency per cluster.)

def target_distribution(q):
    weight = q ** 2 / q.sum(0)
    return (weight.T / weight.sum(1)).T

有必要基於輔助目標分佈的幫助，以從高置信度的結果中進行學習，進而迭代的改善聚類結果。進行特徵次數的迭代後，目標分佈得到了更新，待訓練聚類模型最小化目標分佈和聚類輸出之間的 KL 散度損失函數。訓練策略可以看作自訓練(self-training) 的一種形式。類似於自訓練，採用初始化分類器和無標籤數據集，然後根據分類器標記數據集，以在其高置信度預測結果上進行訓練。

損失函數，KL散度(Kullback-Leibler散度)，衡量了兩種不同分佈之間的差異性。對其進行最小化，使得目標分佈儘可能接近聚類輸出分佈。

如下面的代碼段，每 140 個 epochs 訓練迭代，更新目標分佈：

model.compile(optimizer=SGD(0.01, 0.9), loss='kld')

maxiter = 8000
update_interval = 140
for ite in range(int(maxiter)):
    if ite % update_interval == 0:
        q = model.predict(x, verbose=0)
        # update the auxiliary target distribution p
        p = target_distribution(q)  
        # evaluate the clustering performance
        y_pred = q.argmax(1)
        if y is not None:
            acc = np.round(metrics.acc(y, y_pred), 5)

    idx = index_array[index * batch_size: min((index+1) * batch_size, x.shape[0])]
    model.train_on_batch(x=x[idx], y=p[idx])
    index = index + 1 if (index + 1) * batch_size <= x.shape[0] else 0

6.2. 評價度量

評價度量表明，已達到 96.2％的聚類精度。對於輸入是未標記的圖像，這個結果很不錯了。對該聚類精度分析。

該度量找出無監督算法的聚類和 groundtruth 間的最佳匹配。

可以採用 Hungarian 算法有效地得到該最佳映射，其實現如：scikit learn 庫的 linear_assignment 函數。

from sklearn.utils.linear_assignment_ import linear_assignment

y_true = y.astype(np.int64)
D = max(y_pred.max(), y_true.max()) + 1
w = np.zeros((D, D), dtype=np.int64)
# Confusion matrix.
for i in range(y_pred.size):
    w[y_pred[i], y_true[i]] += 1
ind = linear_assignment(-w)
acc = sum([w[i, j] for i, j in ind]) * 1.0 / y_pred.size

更直接的可視化 - 混淆矩陣：

可以手動快速匹配聚類，如，聚類 1與真實標籤 7 或手寫數字7相匹配。

混淆矩陣的實現：

import seaborn as sns
import sklearn.metrics
import matplotlib.pyplot as plt
sns.set(font_scale=3)
confusion_matrix = sklearn.metrics.confusion_matrix(y, y_pred)

plt.figure(figsize=(16, 14))
sns.heatmap(confusion_matrix, annot=True, fmt="d", annot_kws={"size": 20});
plt.title("Confusion matrix", fontsize=30)
plt.ylabel('True label', fontsize=25)
plt.xlabel('Clustering label', fontsize=25)
plt.show()

7. 卷積自編碼

針對圖像數據集，可以嘗試卷積自動編碼器，而不是僅使用全連接層。

值得說的是，爲了重建圖像，可以採用 deconvolutional 層 (Conv2DTranspose in Keras) 或上採樣層(UpSampling2D) 以減少 artifacts 問題。

8.總結及閱讀材料

這裏介紹了基於 Keras 模型進行無監督聚類分析的方法。預訓練的自編碼器對於降維和參數初始化具有重要作用，然後自定義聚類層，其對目標分佈進行訓練，以進一步改善精度。

在 Keras 建立自動編碼器 - 官方Keras博客

用於聚類分析的無監督深嵌入

9. 完整實現

Keras-DEC.ipynb

轉載來自：https://www.aiuai.cn/aifarm777.html

基於 AutoEncoder 的無監督聚類的實現

1. 聚類的應用場景

2. 如何生成好的聚類

3. 採用 K-Means 設置 baseline

4. 預訓練自編碼器

5. 聚類模型

6. 訓練聚類模型

6.1. 輔助目標分佈和 KL 散度損失

6.2. 評價度量

7. 卷積自編碼

8.總結及閱讀材料

9. 完整實現

這個網絡爬蟲代碼，拿到數據之後如何存到csv文件中去？

即刻放大鏡。跟隨鼠標，屏幕任意位置放大

【面試準備】【SQL】數據庫有哪些約束？

.NET開源強大、易於使用的緩存框架 - FusionCache

面試，有時候是個運氣活

8大排序算法的穩定和不穩定分析

fork()+pipe() --> 父子進程間通過管道通信

FTP的主動傳輸和被動傳輸

torchnet安裝及報錯分析

JSP聲明語句/腳本段/表達式

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結