【用Python玩Machine Learning】KNN * 代碼 * 一

原創

xceman1997

2020-02-20 20:56

KNN的是“k Nearest Neighbors”的簡稱，中文就是“最近鄰分類器”。基本思路就是，對於未知樣本，計算該樣本和訓練集合中每一個樣本之間的距離，選擇距離最近的k個樣本，用這k個樣本所對應的類別結果進行投票，最終多數票的類別就是該未知樣本的分類結果。選擇什麼樣的度量來衡量樣本之間的距離是關鍵。

一、從文本中讀取樣本的特徵和分類結果。

'''
kNN: k Nearest Neighbors
'''

import numpy as np

'''
function: load the feature maxtrix and the target labels from txt file (datingTestSet.txt)
input: the name of file to read
return:
1. the feature matrix
2. the target label
'''
def LoadFeatureMatrixAndLabels(fileInName):

    # load all the samples into memory
    fileIn = open(fileInName,'r')
    lines = fileIn.readlines()

    # load the feature matrix and label vector
    featureMatrix = np.zeros((len(lines),3),dtype=np.float64)
    labelList = list()
    index = 0
    for line in lines:
        items = line.strip().split('\t')
        # the first three numbers are the input features
        featureMatrix[index,:] = [float(item) for item in items[0:3]]
        # the last column is the label
        labelList.append(items[-1])
        index += 1
    fileIn.close()

    return featureMatrix, labelList

每個樣本在文本文件中存儲的格式是：3個特徵值，再加一個分類結果，用tab鍵隔開。代碼中首先把所有文件load進入內存，然後創建了一個“樣本數目 * 特徵數目” 的浮點數矩陣，用0.0初始化。之後，解析每一行數據（樣本），並用解析後的數據初始化矩陣。這一行用了python中的列表推導：

featureMatrix[index,:] = [float(item) for item in items[0:3]]

一個for循環，用一個語句就寫完了，而且運行效率高於（不低於）正常寫法的for循環。現在開始體會到python的好了。

二、特徵值歸一化

特徵值歸一化，對於絕大多數機器學習算法都是必不可少的一步。歸一化的方法通常是取每個特徵維度所對應的最大、最小值，然後用當前特徵值與之比較，歸一化到[0,1]之間的一個數字。如果特徵取值有噪聲的話，還要事先去除噪聲。

'''
function: auto-normalizing the feature matrix
    the formula is: newValue = (oldValue - min)/(max - min)
input: the feature matrix
return: the normalized feature matrix
'''
def AutoNormalizeFeatureMatrix(featureMatrix):

    # create the normalized feature matrix
    normFeatureMatrix = np.zeros(featureMatrix.shape)

    # normalizing the matrix
    lineNum = featureMatrix.shape[0]
    columnNum = featureMatrix.shape[1]
    for i in range(0,columnNum):
        minValue = featureMatrix[:,i].min()
        maxValue = featureMatrix[:,i].max()
        for j in range(0,lineNum):
            normFeatureMatrix[j,i] = (featureMatrix[j,i] - minValue) / (maxValue-minValue)

    return normFeatureMatrix

numpy的基本數據結構是多維數組，矩陣作爲多維數組的一個特例。每個numpy的多維數組都有shape屬性。shape是一個元組（列表？），表徵多維數組中每一個維度的大小，例如：shape[0]表示有多少行，shape[1]表示有多少列...... numpy中的矩陣，對於一行的訪問就是“featureMatrix[i,:]”，對於列的訪問就是“featureMatrix[:,i]”。這部分代碼就是規規矩矩的雙重循環，比較像c；不過原來書中的代碼也用矩陣來計算的，我寫的時候還不熟悉numpy，書中的代碼又調試不通，就直接用c的方式來寫了。

三、樣本之間的距離計算

距離可以有很多種衡量方法，這段代碼寫的是歐氏距離的計算，是計算給定樣本（的特徵向量）和所有訓練樣本之間的距離。

'''
function: calculate the euclidean distance between the feature vector of input sample and
the feature matrix of the samples in training set
input:
1. the input feature vector
2. the feature matrix
return: the distance array
'''
def CalcEucDistance(featureVectorIn, featureMatrix):

    # extend the input feature vector as a feature matrix
    lineNum = featureMatrix.shape[0]
    featureMatrixIn = np.tile(featureVectorIn,(lineNum,1))

    # calculate the Euclidean distance between two matrix
    diffMatrix = featureMatrixIn - featureMatrix
    sqDiffMatrix = diffMatrix ** 2
    distanceValueArray = sqDiffMatrix.sum(axis=1)
    distanceValueArray = distanceValueArray ** 0.5

    return distanceValueArray

用到了numpy中的比較有特色的東西。做法是先將輸入的特徵向量擴展成爲一個特徵矩陣（tile函數乾的，第一個參數是要擴展的東西，第二個參數是在哪些維度上進行擴展：縱向擴展了lineNum次，橫向不進行擴展）。然後，就是擴展出來的矩陣和訓練樣本的矩陣之間的計算了——本來能用向量之間的計算解決的問題，非要擴展成矩陣來做，這效率......可見，python的效率低，一方面的確源於python語言本身的實現和執行效率，另一方面，更源於python寫程序的思維——程序員想偷懶，cpu有啥招兒呢？

未完，待續。

如有轉載，請註明出處：http://blog.csdn.net/xceman1997/article/details/44994001

xceman1997

發佈了167 篇原創文章 · 獲贊 65 · 訪問量 71萬+

他的留言板關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【用Python玩Machine Learning】KNN * 代碼 * 一

Kafka存儲機制

aws語音呼叫調用，告警電話

【轉】[C#] WebAPI 防止併發調用二（冥等性）

HTTP URL 詳解

得物 ZooKeeper SLA 也可以 99.99%

創新工具：2024年開發者必備的一款表格控件（二）

車牌識別控制檯可快速整合二次開發

【讀書筆記】《推薦系統(recommender systems An introduction)》第六章推薦系統的解釋

【轉載】技術向：一文讀懂卷積神經網絡

【用Python玩Machine Learning】KNN * 代碼 * 二

【用Python玩Machine Learning】KNN * 代碼 * 一

【doc2vec】學習筆記：From word2vec to doc2vec: an approach driven by Chinese restaurant process

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結