原文鏈接:https://www.rpubs.com/dvallslanaquera/clustering
層次聚類 (HC)
在這個分析中,我們將看到如何創建層次聚類模型。目的是探索數據庫中是否存在相似性組,並查看它們的行爲。
例如,我們將使用Doubs數據庫,該數據庫基於從法國Doubs河中提取的魚類樣本的物理特徵。其目的是查看樣本的行爲以及如何對數據進行分組。
1- 數據準備
我們需要刪除帶有雙零或NA值的行,否則當我們嘗試創建樹狀圖時,它們將會出現問題。然後我們需要根據它們的距離對值進行規格化。這次我們將使用歐氏距離,但也有其他有用的距離方法。
library(cluster)
library(vegan)
library(ade4)
data(doubs)
spe <- doubs$fish[-8,]
env <- doubs$env[-8,]
spa <- doubs$xy[-8,]
spe.norm <- decostand(spe, "normalize")
spe.ch <- vegdist(spe.norm, "euc")
2- 聚類方法選擇
我們將選擇所有可用的方法,然後我們將選擇一個最佳的驗證分析。方法有單點法、完整法、平均法、質心法和Ward法(single, complete, average, centroid and Ward)。
par(mfrow = c(2, 3))
spe.ch.single <- hclust(spe.ch, method = "single")
plot(spe.ch.single, sub = "Single method")
spe.ch.complete <- hclust(spe.ch, method = "complete")
plot(spe.ch.complete, sub = "Complete method")
spe.ch.UPGMA <- hclust(spe.ch, method = "average")
plot(spe.ch.UPGMA, sub = "Average method")
spe.ch.centroid <- hclust(spe.ch, method = "centroid")
plot(spe.ch.centroid, sub = "Centroid method")
spe.ch.ward <- hclust(spe.ch, method = "ward.D")
plot(spe.ch.ward, sub = "Ward method")
這個樹狀圖的結構證明是有問題的。我們可以在樹狀圖上觀察到重疊,因此這種方法不再有效。
3- 選擇最佳方法
在質心法的情況下,我們可以看到過擬合。在其他情況下,我們必須計算:
- 同表型相關性(Cophenetic correlation)
- 高爾距離值(Gower distance value)
3.1. 同表型相關性
spe.ch.single.coph <- cophenetic(spe.ch.single) # Single
paste("Single cophenetic:", cor(spe.ch, spe.ch.single.coph))
## [1] "Single cophenetic: 0.599192957534406"
spe.ch.comp.coph <- cophenetic(spe.ch.complete) # Complete
paste("Complete cophenetic:", cor(spe.ch, spe.ch.comp.coph))
## [1] "Complete cophenetic: 0.765562801324477"
spe.ch.UPGMA.coph <- cophenetic(spe.ch.UPGMA) # Average
paste("UPGMA cophenetic:", cor(spe.ch, spe.ch.UPGMA.coph))
## [1] "UPGMA cophenetic: 0.860832629864453"
spe.ch.ward.coph <- cophenetic(spe.ch.ward) # Ward
paste("Ward cophenetic:", cor(spe.ch, spe.ch.ward.coph))
## [1] "Ward cophenetic: 0.759393401189188"
cor(spe.ch, spe.ch.ward.coph, method = "spearman")
## [1] 0.7661171
同表型相關的圖示(謝潑德圖 Shepard diagram)。越靠近對角線越好。
par(mfrow = c(2, 2))
plot(spe.ch, spe.ch.single.coph, xlab = "Chord distance",
ylab = "Cophenetic distance", asp = 1, xlim = c(0, sqrt(2)), ylim = c(0, sqrt(2)),
main = c("Single linkage", paste("Cophenetic correlation =",
round(cor(spe.ch, spe.ch.single.coph), 3))))
abline(0, 1); lines(lowess(spe.ch, spe.ch.single.coph), col = "red")
plot(spe.ch, spe.ch.comp.coph, xlab = "Chord distance",
ylab = "Cophenetic distance", asp = 1, xlim = c(0,sqrt(2)), ylim = c(0,sqrt(2)),
main = c("Complete linkage", paste("Cophenetic correlation =",
round(cor(spe.ch, spe.ch.comp.coph), 3))))
abline(0, 1); lines(lowess(spe.ch, spe.ch.comp.coph), col = "red")
plot(spe.ch, spe.ch.UPGMA.coph, xlab = "Chord distance",
ylab = "Cophenetic distance", asp = 1, xlim = c(0,sqrt(2)), ylim = c(0,sqrt(2)),
main = c("UPGMA", paste("Cophenetic correlation =",
round(cor(spe.ch, spe.ch.UPGMA.coph), 3))))
abline(0, 1); lines(lowess(spe.ch, spe.ch.UPGMA.coph), col = "red")
plot(spe.ch, spe.ch.ward.coph, xlab = "Chord distance",
ylab = "Cophenetic distance", asp = 1, xlim = c(0,sqrt(2)),
ylim = c(0,max(spe.ch.ward$height)),
main = c("Ward clustering", paste("Cophenetic correlation =",
round(cor(spe.ch, spe.ch.ward.coph), 3))))
abline(0, 1); lines(lowess(spe.ch, spe.ch.ward.coph), col = "red")
最佳擬合方法爲UPGMA,同位相關係數爲0.861,比較高。
高爾距離值(越低越好)
(gow.dist.single <- sum((spe.ch - spe.ch.single.coph) ^ 2))
## [1] 95.41391
(gow.dist.comp <- sum((spe.ch - spe.ch.comp.coph) ^ 2))
## [1] 40.48897
(gow.dist.UPGMA <- sum((spe.ch - spe.ch.UPGMA.coph) ^ 2))
## [1] 11.6746
(gow.dist.ward <- sum((spe.ch - spe.ch.ward.coph) ^ 2))
## [1] 8001.85
最佳方法仍然是UPGMA。
3- 最後聚類數目的選擇
爲了達到這個目的,我們需要 3 個不同的檢驗:
- a- Fussion 水平圖
- b- Silhouette 圖(輪廓係數圖)
- c- Mantel 值
a- Fussion 水平圖
dev.off()
## null device
## 1
plot(spe.ch.ward$height, nrow(spe) : 2, type = "S",
main = "Fusion levels - Chord - Ward",
ylab = "k (number of clusters)", xlab = "h (node height)", col = "grey")
text(spe.ch.ward$height, nrow(spe) : 2, nrow(spe) : 2, col = "red", cex = 0.8)
根據圖結果,最佳數目是 2 或 4。
b- Silhouette 圖
asw <- numeric(nrow(spe))
for(k in 2:(nrow(spe) - 1)){
sil <- silhouette(cutree(spe.ch.ward, k = k), spe.ch)
asw[k] <- summary(sil)$avg.width}
k.best <- which.max(asw)
plot(1: nrow(spe), asw, type="h",
main = "Silhouette-optimal number of clusters",
xlab = "k (number of groups)", ylab = "Average silhouette width")
axis(1, k.best, paste("optimum", k.best, sep = "\n"), col = "red", font = 2,
col.axis = "red")
points(k.best, max(asw), pch = 16, col = "red", cex = 1.5)
cat("", "Silhouette-optimal number of clusters k =", k.best, "\n",
"with an average silhouette width of", max(asw), "\n")
## Silhouette-optimal number of clusters k = 2
## with an average silhouette width of 0.3658319
c- Mantel 值
grpdist <- function(X){
require(cluster)
gr <- as.data.frame(as.factor(X))
distgr <- daisy(gr, "gower")
distgr}
kt <- data.frame(k = 1:nrow(spe), r = 0)
for(i in 2:(nrow(spe) - 1)){
gr <- cutree(spe.ch.ward, i)
distgr <- grpdist(gr)
mt <- cor(spe.ch, distgr, method = "pearson")
kt[i, 2] <- mt}
k.best <- which.max(kt$r)
plot(kt$k, kt$r, type = "h", main = "Mantel-optimal number of clusters",
xlab = "k (number of groups)", ylab = "Pearson's correlation")
axis(1, k.best, paste("optimum", k.best, sep = "\n"), col = "red", font = 2,
col.axis = "red")
points(k.best, max(kt$r), pch = 16, col = "red", cex = 1.5)
cat("", "Mantel-optimal number of clusters k =", k.best, "\n",
"with a matrix linear correlation of", max(kt$r), "\n")
## Mantel-optimal number of clusters k = 4
## with a matrix linear correlation of 0.7154912
第一和第二種方法表明,聚類的最佳數量是k = 2,而 Mantel 值表明,數量必須是4。由於理論表明物種的數量爲4,我們將在k = 4時驗證我們的模型。
4- 最後模型驗證 k = 4
k <- 4
cutg <- cutree(spe.ch.ward, k = k)
sil <- silhouette(cutg, spe.ch)
rownames(sil) <- row.names(spe)
plot(sil, main = "Silhouette plot",
cex.names = 0.8, col = 2:(k + 1), nmax = 100)
這些組是一致的,但是第二組有兩個未分類的值。
5- 最後的圖
到目前爲止,我們已經通過UPGMA方法將我們的數據分組爲4個簇。現在我們將使用Francois Gillet(2012)創建的hcoplot函數來描述樹圖的行爲。
hcoplot <- function(tree, diss, k,
title = paste("Reordered dendrogram from", deparse(tree$call), sep = "\n"))
{
require(gclus)
gr <- cutree(tree, k = k)
tor <- reorder.hclust(tree, diss)
plot(tor, hang = -1, xlab = paste(length(gr),"sites"), sub = paste(k, "clusters"),
main = title)
so <- gr[tor$order]
gro <- numeric(k)
for (i in 1:k)
{
gro[i] <- so[1]
if (i<k) so <- so[so != gro[i]]
}
rect.hclust(tor, k = k, border = gro + 1, cluster = gr)
legend("topright", paste("Cluster", 1:k), pch = 22, col = 2:(k + 1), bty = "n")
}
hcoplot(spe.ch.ward, spe.ch, k = 4)
非層次聚類 (NHC)
這次我們將做一個k均值聚類模型。
1-數據標準化
之前已經做過。
2- 選擇聚類方法
set.seed(1)
spe.kmeans <- kmeans(spe.norm, centers = 4, nstart = 100)
我們創建了包含4組的模型,與之前的HC模型相同。這裏是每一組的中心,每一組的方差。該模型可以解釋總方差的66.7%。
3- 選擇聚類數和模型驗證
我們使用以下標準:
- Calinski & Harabasz 值
- Simple structure index (SSI)
- Sum of squared errors (SSE)
- Silhouette 圖
3.1. Calinski method + SSI + SSE
spe.KM.cascade <- cascadeKM(spe.norm, inf.gr = 2, sup.gr = 10, iter = 100, criterion = "ssi")
spe.KM.cascade$results
## 2 groups 3 groups 4 groups 5 groups 6 groups 7 groups 8 groups
## SSE 8.2149405 6.4768108 5.0719796 4.3015573 3.5856120 2.9523667 2.48405487
## ssi 0.1312111 0.1675852 0.1244603 0.1149501 0.1281785 0.1306085 0.07275788
## 9 groups 10 groups
## SSE 2.052189 1.75992916
## ssi 0.126707 0.07094594
plot(spe.KM.cascade, sortg = TRUE)
統計數字不能決定一切。通過SSE方法,最好的聚類數必須是2,通過SSI方法則必須是3。
3.2. Silhouette 圖
我們試着繪製 3 組的輪廓係數圖。
spe.kmeans <- kmeans(spe.norm, centers = 3, nstart = 100)
dissE <- daisy(spe.norm)
sk <- silhouette(spe.kmeans$cl, dissE)
plot(sk)
4- 最後的圖形和解釋
現在我們繼續比較前一個樹狀圖和之前的分類。
spebc.ward.g <- cutree(spe.ch.ward,k = 4)
table(spe.kmeans$cluster, spebc.ward.g)
## spebc.ward.g
## 1 2 3 4
## 1 0 6 0 0
## 2 11 1 0 0
## 3 0 0 8 3
它們只在一種情況的分類上有所不同。
clusplot(spe.norm, spe.kmeans$cluster, color = TRUE, shade = TRUE,
labels = 2, lines = 0)
組與組之間有一些重疊,但我們解釋了高達61.75%的變異性,這是一個很好的百分比。