ML入門1.0 -- 手寫KNN

原創

2020-07-06 12:38

ML入門1.0 -- 手寫KNN

全文內容

全文內容

KNN簡介

KNN 全稱爲 K Nearest Neighbors 中文又稱 K- 近鄰算法 ，是一種用與分類和迴歸的非參數統計的方法。KNN採用向量空間模型來分類，概念爲相同類別的案例，彼此的相似度高，而可以藉由計算與已知類別案例之相似度，來評估未知類別案例可能的分類。
KNN是一種基於實例的學習，或者是局部近似和將所有計算推遲到分類之後的惰性學習。k-近鄰算法是所有的機器學習算法中最簡單的之一。

寫作背景

寫這篇博客呢，主要是因爲最近學校開了 ML 的課程，記錄一下學習過程中的一些想法和踩過的坑，方便大家一起交流學習。

算法原理

KNN常用於有監督學習，作爲一個分類算法，那麼它的主要作用就是確定同一事物中的不同類別，例如區分三種鸞尾花，將不同身高的人類進行分類···
KNN算法的工作原理也很簡單：給定一個具有若干數據的數據集，該數據集分爲兩個部分— 訓練樣本a & 測試樣本b，訓練樣本a是一些既有特徵向量（特徵數據） 和 標籤（類別） 的數據集合；測試樣本b 是隻有特徵向量未知其類別的數據集合，那麼確定測試樣本b中的單個樣本是基於該樣本與訓練樣本a中樣本的特徵向量的某種距離度量，在訓練樣本a中找出K個與該樣本距離度量最小的樣本，之後基於這K個樣本的標籤信息進行投票，選擇K個樣本中出現最多的類別作爲該單個測試樣本標籤（類別）的預測值。PS：一般情況下K<=20; 距離度量選擇歐式距離。

歐式距離計算公式：
$d = \sqrt{\sum_{i=0}^{n}(Xi-Yi)^2}$

以上解釋涉及少量專業術語，考慮本人表述能力較差建議選擇性閱讀，這裏給出一個非常生動的解釋：
【數學】一隻兔子幫你理解 kNN - 宏觀經濟算命椰的文章 - 知乎
link.

算法結構

Step1: 加載數據集
Step2: 分類
- 2.1 找到k個近鄰
- 2.2 投票

KNN實現

這裏的代碼是用Python實現：
- 數據集選用：sklearn.datasets.load_iris (150個樣本，3個類別)

Func1(): Loaddata() 讀取數據集數據並劃分訓練集（120）和 測試集 （30）

# Step1 load iris data
def Loaddata():
    '''
    :tempDataset: the returned Bunch object
    :X: 150 flowers' data (花朵的特徵數據)
    :Y: 150 flowers' label (花朵的種類標籤)
    :return: X1(train_set), Y1(train_label), X2(test_set), Y2(test_label)
    '''

    tempDataset = sklearn.datasets.load_iris()
    X = tempDataset.data
    Y = tempDataset.target
    # Step2 split train set & test set
    X1, X2, Y1, Y2 = sklearn.model_selection.train_test_split(X, Y, test_size=0.2)
    return X1, Y1, X2, Y2

Func2(): euclideanDistance(x, y) 計算單個樣本之間的歐式距離

def euclideanDistance(x, y):
    '''
    計算歐式距離
    :param x: 某一朵花的數據
    :param y: 另一朵花的數據
    :return: 兩個花朵數據之間的歐式距離
    '''
    tempDistance = 0
    m = x.shape[0]
    for i in range(m):
        tempDifference = x[i] - y[i]
        tempDistance += tempDifference * tempDifference
    return tempDistance**0.5

Func3(): stKnnClassifierTest(X1, Y1, X2, Y2, K = 5)
) 不當調包俠 手寫KNN分類器

# Step3 Classify
def stKnnClassifierTest(X1, Y1, X2, Y2, K = 5):
    '''
    :param X1: train_set
    :param Y1: train_label
    :param X2: test_set
    :param Y2: test_label
    :param K: 所選的neighbor的數量
    :return: no return
    '''
    tempStartTime = time.time()
    tempScore = 0
    test_Instances = Y2.shape[0]
    train_Instances = Y1.shape[0]
    print('the num of testInstances = {}'.format(test_Instances))
    print('the num of trainInstances = {}'.format(train_Instances))
    tempPredicts = np.zeros((test_Instances))

    for i in range(test_Instances):
        # tempDistacnes = np.zeros((test_Instances))

        # Find K neighbors
        tempNeighbors = np.zeros(K + 2)
        tempDistances = np.zeros(K + 2)

        for j in range(K + 2):
            tempDistances[j] = 1000
        tempDistances[0] = -1

        for j in range(train_Instances):
            tempdis = euclideanDistance(X2[i], X1[j])
            tempIndex = K
            while True:
                if tempdis < tempDistances[tempIndex]:
            # prepare move forward
                    tempDistances[tempIndex + 1] = tempDistances[tempIndex]
                    tempNeighbors[tempIndex + 1] = tempNeighbors[tempIndex]
                    tempIndex -= 1
            #insert
                else:
                    tempDistances[tempIndex + 1] = tempdis
                    tempNeighbors[tempIndex + 1] = j
                    break

        # Vote
        tempLabels = []
        for j in range(K):
            tempIndex = int(tempNeighbors[j + 1])
            tempLabels.append(int(Y1[tempIndex]))

        tempCounts = []
        for label in tempLabels:
            tempCounts.append(int(tempLabels.count(label)))
        tempPredicts[i] = tempLabels[np.argmax(tempCounts)]

    # the rate of correct classify
    tempCorrect = 0
    for i in range(test_Instances):
        if tempPredicts[i] == Y2[i]:
            tempCorrect += 1

    tempScore = tempCorrect / test_Instances

    tempEndTime = time.time()
    tempRunTime = tempEndTime - tempStartTime

    print(' ST KNN score: {}%, runtime = {}'.format(tempScore*100, tempRunTime))

運行結果

完整代碼見github

link

優缺點

優點：
1.簡單，易於理解，易於實現，無需估計參數；
2. 適合對稀有事件進行分類；
3.特別適合於多分類問題(multi-modal,對象具有多個類別標籤
缺點：
1.該算法在分類時有個主要的不足是，當樣本不平衡時，如一個類的樣本容量很大，而其他類樣本容量很小時，有可能導致當輸入一個新樣本時，該樣本的K個鄰居中大容量類的樣本佔多數。該算法只計算“最近的”鄰居樣本，某一類的樣本數量很大，那麼或者這類樣本並不接近目標樣本，或者這類樣本很靠近目標樣本。無論怎樣，數量並不能影響運行結果。
2.該方法的另一個不足之處是計算量較大，因爲對每一個待分類的文本都要計算它到全體已知樣本的距離，才能求得它的K個最近鄰點。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

ML入門1.0 -- 手寫KNN

ML入門1.0 -- 手寫KNN

全文內容

KNN簡介

寫作背景

算法原理

算法結構

KNN實現

運行結果

完整代碼見github

優缺點

ziw2pdf

apisix~helm方式的部署到k8s

firmeye - IoT固件漏洞挖掘工具

ML入門1.0 -- 手寫KNN

ML入門6.0 手寫K-Means 聚類（K-Means Clustering）

ML入門5.0--手寫集成學習（Ensemble learning）

ML入門4.0 手寫邏輯斯蒂迴歸（Logistic Regression）

ML入門3.0 -- 手寫樸素貝葉斯（Na ̈ıve Bayes）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結