EM算法實現


數據準備

本文實現的是利用EM算法學習高斯混合模型,爲了簡化過程採用對離散點進行聚類判定,離散點通過sklearn生成。


EM算法

EM算法的推導證明和收斂分析暫時留坑。

EM算法中對隱變量和觀測變量的交替估計,給我的第一感覺是有點像SLAM裏對landmark和pose的聯合優化。


高斯混合模型參數估計

高斯混合模型的算法流程如下:
GMM-EM
實現代碼如下:

# @Author: phd
# @Date: 2019/11/11
# @Site: github.com/phdsky
# @Description: NULL

import time
import logging

import numpy as np
from scipy.stats import multivariate_normal

from matplotlib import pyplot as plt
from sklearn.datasets.samples_generator import make_blobs


class GMM(object):gmm
    def __init__(self, K, X, init_method):
        self.K = K
        self.Y = X
        self.N, self.D = X.shape  # data shape

        self.init = init_method
        if self.init == 'random':
            self.mean = np.random.rand(self.K, self.D)
            self.cov = np.asarray([np.eye(self.D)] * K)
            self.alpha = np.asarray([1. / self.K] * self.K)

        elif self.init == 'kmeans':
            print("Not implemented yet...")
        else:
            print("WTF the init type is?")

    def calc_phi(self, Y, mean, cov):
        return multivariate_normal.pdf(x=Y, mean=mean, cov=cov)

    def E_step(self):
        gamma = np.zeros((self.N, self.K))

        for k in range(self.K):
            gamma[:, k] =  self.calc_phi(self.Y, self.mean[k], self.cov[k])

        for k in range(self.K):
            gamma[:, k] *= self.alpha[k]

        for n in range(self.N):
            gamma[n, :] /= np.sum(gamma[n, :])

        return gamma

    def M_step(self, gamma):
        # cov computation use old mean value
        # so, update cov first
        for k in range(self.K):
            gamma_k = np.reshape(gamma[:, k], (self.N, 1))
            gamma_k_sum = np.sum(gamma_k)

            # cov
            y_mean = self.Y - self.mean[k]
            self.cov[k] = y_mean.T.dot(np.multiply(y_mean, gamma_k)) / gamma_k_sum

            # mean
            self.mean[k] = np.sum(np.multiply(gamma_k, self.Y), axis=0) / gamma_k_sum

            # alpha
            self.alpha[k] = gamma_k_sum / self.N

    def train(self, max_iteration):
        for i in range(max_iteration):
            print("Iteration: %d" % i)

            # Take E Step
            gamma = self.E_step()

            # Take M Step
            self.M_step(gamma=gamma)

    def predict(self):
        gamma = self.E_step()

        colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k', 'w']
        predictions = np.argmax(gamma, axis=1)
        for k in range(self.K):

            # plot mean point
            plt.scatter(self.mean[k][0], self.mean[k][1], c='black', edgecolors='none', marker='D')

            # plot points
            pred_ids = np.where(predictions == k)
            plt.scatter(self.Y[pred_ids[0], 0], self.Y[pred_ids[0], 1], c=colors[k], alpha=0.4, edgecolors='none', marker='s')

        plt.show()

if __name__ == "__main__":
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)

    clusters = 4
    X, y = make_blobs(n_samples=100*clusters, centers=clusters, cluster_std=0.5, random_state=0)
    # plt.scatter(X[:, 0], X[:, 1])
    # plt.show()

    # Available init method: random / kmeans
    gmm = GMM(K=clusters, X=X, init_method='random')

    gmm.train(max_iteration=200)

    gmm.predict()
    

輸出分類結果如下圖:
gmm-cluster


總結

  1. EM算法重點在於定義求極大似然期望的Q函數,E步通過給定的觀測變量和當前的函數參數估計求隱變量的概率分佈;M步通過Q函數求極大化對各個參數進行求導,得到參數變量的新估計值。
    Q

參考

  1. 《統計學習方法》
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章