K-means高維與PCA降維
查了很多博客,發現網上給出的K-means都是二維數據,沒有高維示例,所以筆者將抄來的二維模板稍修改了一下,把Python實現的K-means高維聚類貼出來與大家共享,同時附一點PCA的實現(調用了三維作圖,沒有安包的可以刪掉三維圖)
K-means高維聚類
老師留的題目
Automatically determine the best cluster number K in k-means.
• Generate 1000 random N-Dimensional points;
• Try different K number;
• Compute SSE;
• Plot K-SSE figure;
• Choose the best K number (how to choose?).
Try different N number: 2, 3, 5, 10 Write the code in Jupyter Notebook Give me the screenshot of the code and results in Jupyter Notebook
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
N=2 #維度
def distance_fun(p1, p2, N):
result=0
for i in range(0,N):
result=result+((p1[i]-p2[i])**2)
return np.sqrt(result)
def mean_fun(a):
return np.mean(a,axis=0)
def farthest(center_arr, arr):
f = [0, 0]
max_d = 0
for e in arr:
d = 0
for i in range(center_arr.__len__()):
d = d + np.sqrt(distance_fun(center_arr[i], e, N))
if d > max_d:
max_d = d
f = e
return f
def closest(a, arr):
c = arr[1]
min_d = distance_fun(a, arr[1])
arr = arr[1:]
for e in arr:
d = distance_fun(a, e)
if d < min_d:
min_d = d
c = e
return c
if __name__=="__main__":
arr = np.random.randint(0,10000, size=(1000, 1, N))[:, 0, :] #1000個0-10000隨機數
'''
block1= np.random.randint(0,2000, size=(100, 1, N))[:, 0, :] #分區間生成隨機數
block2 = np.random.randint(2000,4000, size=(100, 1, N))[:, 0, :]
block3 = np.random.randint(4000,6000, size=(100, 1, N))[:, 0, :]
block4 = np.random.randint(6000,8000, size=(100, 1, N))[:, 0, :]
block5 = np.random.randint(8000,10000, size=(100, 1, N))[:, 0, :]
arr=np.vstack((block1,block2,block3,block4,block5))
'''
## 初始化聚類中心和聚類容器
K = 5
r = np.random.randint(arr.__len__() - 1)
center_arr = np.array([arr[r]])
cla_arr = [[]]
for i in range(K-1):
k = farthest(center_arr, arr)
center_arr = np.concatenate([center_arr, np.array([k])])
cla_arr.append([])
## 迭代聚類
n = 20
cla_temp = cla_arr
for i in range(n):
for e in arr:
ki = 0
min_d = distance_fun(e, center_arr[ki],N)
for j in range(1, center_arr.__len__()):
if distance_fun(e, center_arr[j],N) < min_d:
min_d = distance_fun(e, center_arr[j],N)
ki = j
cla_temp[ki].append(e)
for k in range(center_arr.__len__()):
if n - 1 == i:
break
center_arr[k] = mean_fun(cla_temp[k])
cla_temp[k] = []
if N>=2:
print(N,'維數據前兩維投影')
col = ['gold', 'blue', 'violet', 'cyan', 'red','black','lime','brown','silver']
plt.figure(figsize=(10, 10))
for i in range(K):
plt.scatter(center_arr[i][0], center_arr[i][1], color=col[i])
plt.scatter([e[0] for e in cla_temp[i]], [e[1] for e in cla_temp[i]], color=col[i])
plt.show()
if N>=3:
print(N,'維數據前三維投影')
fig = plt.figure(figsize=(8, 8))
ax = Axes3D(fig)
for i in range(K):
ax.scatter(center_arr[i][0], center_arr[i][1], center_arr[i][2], color=col[i])
ax.scatter([e[0] for e in cla_temp[i]], [e[1] for e in cla_temp[i]],[e[2] for e in cla_temp[i]], color=col[i])
plt.show()
print(N,'維')
for i in range(K):
print('第',i+1,'個聚類中心座標:')
for j in range(0,N):
print(center_arr[i][j])
首先解釋一下代碼,N爲生成隨機數字的維數,K爲聚類中心數。
隨機數我設定了兩個部分,一個是0-10000完全隨機數,另一個是0-10000均分成五個區間生成隨機數。爲什麼要這麼做呢?因爲這樣可以判別分類結果的正確與否。
如圖
這是K=5時的完全隨機數結果,可以看到初步分類正確
K=5時三維的完全隨機數結果,可以看到初步分類正確
下面我們用區間生成隨機數,這樣可以直觀地判別均值迭代好壞
可以看到分區間產生的隨機數能夠非常好地展示出算法的準確性,說明迭代的效果是理想的。
下面我們來看更高維情況,如果N>=4時,圖像沒有辦法直觀顯示了,我們只能依次取出N中的3維或2維做投影來觀察。爲了直接驗證高維情況的正確與否,我們不再一張張地作圖,而是直接判斷聚類中心的座標是否在隨機數區間內。(此時就體現出分區間的隨機數據及其重要)
5個聚類中心都在理想區間內,結果同樣正確。
更新補一下:SSE的實現
sse[i]=sse[i]+distance_fun(e, center_arr[ki],N)
在迭代中加個數組存放每次殘差平方和數據
SSE圖像大致如下:
很多時候會有小小的波折,暫時還沒想清楚數學原理
PCA
題目
• Generate 5,00,000 random points with 200-D
• Dimension reduction to keep 90% energy using PCA
• Report how many dimensions are kept
• Compute k-means (k=100)
• Compare brute force NN and kd-tree, and report their running time
• Python, Jupyter Notebook
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
N=200 #維度
def distance_fun(p1, p2, N):
result=0
for i in range(0,N):
result=result+((p1[i]-p2[i])**2)
return np.sqrt(result)
def mean_fun(a):
return np.mean(a,axis=0)
def farthest(center_arr, arr):
f = [0, 0]
max_d = 0
for e in arr:
d = 0
for i in range(center_arr.__len__()):
d = d + np.sqrt(distance_fun(center_arr[i], e, N))
if d > max_d:
max_d = d
f = e
return f
def closest(a, arr):
c = arr[1]
min_d = distance_fun(a, arr[1])
arr = arr[1:]
for e in arr:
d = distance_fun(a, e)
if d < min_d:
min_d = d
c = e
return c
def pca(XMat):
average = mean_fun(XMat)
m, n = np.shape(XMat)
data_adjust = []
avgs = np.tile(average, (m, 1))
data_adjust = XMat - avgs
covX = np.cov(data_adjust.T) #計算協方差矩陣
featValue, featVec= np.linalg.eig(covX) #求解協方差矩陣的特徵值和特徵向量
index = np.argsort(-featValue) #依照featValue進行從大到小排序
sumfeatvalue=sum(index)
sumt=0
k=0
while(sumt<0.9*sumfeatvalue):
sumt+=index[k]
k+=1
finalData = []
selectVec = np.matrix(featVec.T[index]) #所以這裏須要進行轉置
finalData = data_adjust * selectVec.T
reconData = (finalData * selectVec) + average
return finalData, reconData,k
def plotBestFit(data1, data2):
dataArr1 = np.array(data1)
dataArr2 = np.array(data2)
m = np.shape(dataArr1)[0]
axis_x1 = []
axis_y1 = []
axis_x2 = []
axis_y2 = []
for i in range(m):
axis_x1.append(dataArr1[i,0])
axis_y1.append(dataArr1[i,1])
axis_x2.append(dataArr2[i,0])
axis_y2.append(dataArr2[i,1])
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111)
#ax.scatter(axis_x1, axis_y1, s=50, c='red', marker='s')
ax.scatter(axis_x2, axis_y2, s=1, c='blue')
plt.show()
if __name__ == "__main__":
'''
arr = np.random.randint(0,10000, size=(1000, 1, N))[:, 0, :]
XMat=arr
'''
block1= np.random.randint(0,2000, size=(100000, 1, N))[:, 0, :] #分區間生成隨機數
block2 = np.random.randint(2000,4000, size=(100000, 1, N))[:, 0, :]
block3 = np.random.randint(4000,6000, size=(100000, 1, N))[:, 0, :]
block4 = np.random.randint(6000,8000, size=(100000, 1, N))[:, 0, :]
block5 = np.random.randint(8000,10000, size=(100000, 1, N))[:, 0, :]
XMat=np.vstack((block1,block2,block3,block4,block5))
finalData, reconMat,pcaN = pca(XMat)
plotBestFit(finalData, reconMat) #輸出前兩維切片檢查效果
print('降維到:',pcaN)
由於數據量很大,計算時間在幾十秒左右,降維後200維大概到150維左右。PCA控制有兩種方法,一種是根據特徵值保留量計算應降維數,就像本題中的0.9。或者可以指定下降後維數,可以直接定量轉換,但這種方法在實際中很不常用,因爲當數據維度很高時,我們是無法直觀的預知應降的維數的。代碼中作圖有兩個部分,未偏移的降維數據和偏移後的降維數據,前者我註釋掉了,可以跟據需求自己修改。
下面皮一下,200維降到二維看一下圖像(只展示偏移後降維數據):
可以看到,分區間隨機數的特徵值基本還保留,效果蠻好,繼續K-means也可以得到較好的結果,但這並不意味着我們可以直接指定下降後的維數。因爲實際問題中的分類。遠遠比這簡單的五個區間複雜,肆意指定維數會造成不可預計的數據丟失。