「r<-統計分析」層次聚類與非層次聚類 層次聚類 (HC) 2- 聚類方法選擇 非層次聚類 (NHC)

原文鏈接:https://www.rpubs.com/dvallslanaquera/clustering

層次聚類 (HC)

在這個分析中,我們將看到如何創建層次聚類模型。目的是探索數據庫中是否存在相似性組,並查看它們的行爲。

例如,我們將使用Doubs數據庫,該數據庫基於從法國Doubs河中提取的魚類樣本的物理特徵。其目的是查看樣本的行爲以及如何對數據進行分組。

1- 數據準備

我們需要刪除帶有雙零或NA值的行,否則當我們嘗試創建樹狀圖時,它們將會出現問題。然後我們需要根據它們的距離對值進行規格化。這次我們將使用歐氏距離,但也有其他有用的距離方法。

library(cluster)
library(vegan)
library(ade4)

data(doubs)
spe <- doubs$fish[-8,] 
env <- doubs$env[-8,]
spa <- doubs$xy[-8,]

spe.norm <- decostand(spe, "normalize") 
spe.ch <- vegdist(spe.norm, "euc") 

2- 聚類方法選擇

我們將選擇所有可用的方法,然後我們將選擇一個最佳的驗證分析。方法有單點法、完整法、平均法、質心法和Ward法(single, complete, average, centroid and Ward)。

par(mfrow = c(2, 3)) 

spe.ch.single <- hclust(spe.ch, method = "single") 
plot(spe.ch.single, sub = "Single method")

spe.ch.complete <- hclust(spe.ch, method = "complete")
plot(spe.ch.complete, sub = "Complete method") 

spe.ch.UPGMA <- hclust(spe.ch, method = "average") 
plot(spe.ch.UPGMA, sub = "Average method")

spe.ch.centroid <- hclust(spe.ch, method = "centroid") 
plot(spe.ch.centroid, sub = "Centroid method")

spe.ch.ward <- hclust(spe.ch, method = "ward.D") 
plot(spe.ch.ward, sub = "Ward method")

這個樹狀圖的結構證明是有問題的。我們可以在樹狀圖上觀察到重疊,因此這種方法不再有效。

3- 選擇最佳方法

在質心法的情況下,我們可以看到過擬合。在其他情況下,我們必須計算:

  • 同表型相關性(Cophenetic correlation)
  • 高爾距離值(Gower distance value)

3.1. 同表型相關性

spe.ch.single.coph <- cophenetic(spe.ch.single) # Single
paste("Single cophenetic:", cor(spe.ch, spe.ch.single.coph))
## [1] "Single cophenetic: 0.599192957534406"
spe.ch.comp.coph <- cophenetic(spe.ch.complete) # Complete
paste("Complete cophenetic:", cor(spe.ch, spe.ch.comp.coph))
## [1] "Complete cophenetic: 0.765562801324477"
spe.ch.UPGMA.coph <- cophenetic(spe.ch.UPGMA) # Average
paste("UPGMA cophenetic:", cor(spe.ch, spe.ch.UPGMA.coph))
## [1] "UPGMA cophenetic: 0.860832629864453"
spe.ch.ward.coph <- cophenetic(spe.ch.ward) # Ward 
paste("Ward cophenetic:", cor(spe.ch, spe.ch.ward.coph))
## [1] "Ward cophenetic: 0.759393401189188"
cor(spe.ch, spe.ch.ward.coph, method = "spearman")
## [1] 0.7661171

同表型相關的圖示(謝潑德圖 Shepard diagram)。越靠近對角線越好。

par(mfrow = c(2, 2))
plot(spe.ch, spe.ch.single.coph, xlab = "Chord distance", 
     ylab = "Cophenetic distance", asp = 1, xlim = c(0, sqrt(2)), ylim = c(0, sqrt(2)),
     main = c("Single linkage", paste("Cophenetic correlation =",
                                    round(cor(spe.ch, spe.ch.single.coph), 3))))
abline(0, 1);  lines(lowess(spe.ch, spe.ch.single.coph), col = "red")

plot(spe.ch, spe.ch.comp.coph, xlab = "Chord distance", 
     ylab = "Cophenetic distance", asp = 1, xlim = c(0,sqrt(2)), ylim = c(0,sqrt(2)),
     main = c("Complete linkage", paste("Cophenetic correlation =",
                                      round(cor(spe.ch, spe.ch.comp.coph), 3))))
abline(0, 1);  lines(lowess(spe.ch, spe.ch.comp.coph), col = "red")

plot(spe.ch, spe.ch.UPGMA.coph, xlab = "Chord distance", 
     ylab = "Cophenetic distance", asp = 1, xlim = c(0,sqrt(2)), ylim = c(0,sqrt(2)),
     main = c("UPGMA", paste("Cophenetic correlation =",
                           round(cor(spe.ch, spe.ch.UPGMA.coph), 3))))
abline(0, 1);  lines(lowess(spe.ch, spe.ch.UPGMA.coph), col = "red")

plot(spe.ch, spe.ch.ward.coph, xlab = "Chord distance", 
     ylab = "Cophenetic distance", asp = 1, xlim = c(0,sqrt(2)), 
     ylim = c(0,max(spe.ch.ward$height)),
     main = c("Ward clustering", paste("Cophenetic correlation =",
                                     round(cor(spe.ch, spe.ch.ward.coph), 3))))
abline(0, 1);  lines(lowess(spe.ch, spe.ch.ward.coph), col = "red")

最佳擬合方法爲UPGMA,同位相關係數爲0.861,比較高。

高爾距離值(越低越好)

(gow.dist.single <- sum((spe.ch - spe.ch.single.coph) ^ 2))
## [1] 95.41391
(gow.dist.comp <- sum((spe.ch - spe.ch.comp.coph) ^ 2))
## [1] 40.48897
(gow.dist.UPGMA <- sum((spe.ch - spe.ch.UPGMA.coph) ^ 2))
## [1] 11.6746
(gow.dist.ward <- sum((spe.ch - spe.ch.ward.coph) ^ 2))
## [1] 8001.85

最佳方法仍然是UPGMA

3- 最後聚類數目的選擇

爲了達到這個目的,我們需要 3 個不同的檢驗:

  • a- Fussion 水平圖
  • b- Silhouette 圖(輪廓係數圖)
  • c- Mantel 值

a- Fussion 水平圖

dev.off()
## null device 
##           1
plot(spe.ch.ward$height, nrow(spe) : 2, type = "S", 
     main = "Fusion levels - Chord - Ward", 
     ylab = "k (number of clusters)", xlab = "h (node height)", col = "grey")
text(spe.ch.ward$height, nrow(spe) : 2, nrow(spe) : 2, col = "red", cex = 0.8)

根據圖結果,最佳數目是 2 或 4。

b- Silhouette 圖

asw <- numeric(nrow(spe))
for(k in 2:(nrow(spe) - 1)){
  sil <- silhouette(cutree(spe.ch.ward, k = k), spe.ch)
  asw[k] <- summary(sil)$avg.width}
k.best <- which.max(asw)

plot(1: nrow(spe), asw, type="h", 
     main = "Silhouette-optimal number of clusters", 
     xlab = "k (number of groups)", ylab = "Average silhouette width")
axis(1, k.best, paste("optimum", k.best, sep = "\n"), col = "red", font = 2,
     col.axis = "red")
points(k.best, max(asw), pch = 16, col = "red", cex = 1.5)
cat("", "Silhouette-optimal number of clusters k =", k.best, "\n", 
    "with an average silhouette width of", max(asw), "\n")
##  Silhouette-optimal number of clusters k = 2 
##  with an average silhouette width of 0.3658319

c- Mantel 值

grpdist <- function(X){
  require(cluster)
  gr <- as.data.frame(as.factor(X))
  distgr <- daisy(gr, "gower")
  distgr}

kt <- data.frame(k = 1:nrow(spe), r = 0)
for(i in 2:(nrow(spe) - 1)){
  gr <- cutree(spe.ch.ward, i)
  distgr <- grpdist(gr)
  mt <- cor(spe.ch, distgr, method = "pearson")
  kt[i, 2] <- mt}
k.best <- which.max(kt$r)

plot(kt$k, kt$r, type = "h", main = "Mantel-optimal number of clusters", 
     xlab = "k (number of groups)", ylab = "Pearson's correlation")
axis(1, k.best, paste("optimum", k.best, sep = "\n"), col = "red", font = 2,
     col.axis = "red")
points(k.best, max(kt$r), pch = 16, col = "red", cex = 1.5)
cat("", "Mantel-optimal number of clusters k =", k.best, "\n", 
    "with a matrix linear correlation of", max(kt$r), "\n")
##  Mantel-optimal number of clusters k = 4 
##  with a matrix linear correlation of 0.7154912

第一和第二種方法表明,聚類的最佳數量是k = 2,而 Mantel 值表明,數量必須是4。由於理論表明物種的數量爲4,我們將在k = 4時驗證我們的模型。

4- 最後模型驗證 k = 4

k <- 4 
cutg <- cutree(spe.ch.ward, k = k)
sil <- silhouette(cutg, spe.ch)
rownames(sil) <- row.names(spe)

plot(sil, main = "Silhouette plot", 
     cex.names = 0.8, col = 2:(k + 1), nmax = 100)

這些組是一致的,但是第二組有兩個未分類的值。

5- 最後的圖

到目前爲止,我們已經通過UPGMA方法將我們的數據分組爲4個簇。現在我們將使用Francois Gillet(2012)創建的hcoplot函數來描述樹圖的行爲。

hcoplot <- function(tree, diss, k, 
                      title = paste("Reordered dendrogram from", deparse(tree$call), sep = "\n"))
{
  require(gclus)
  gr <- cutree(tree, k = k)
  tor <- reorder.hclust(tree, diss)
  plot(tor, hang = -1, xlab = paste(length(gr),"sites"), sub = paste(k, "clusters"), 
       main = title)
  so <- gr[tor$order]
  gro <- numeric(k)
  for (i in 1:k)
  {
    gro[i] <- so[1]
    if (i<k) so <- so[so != gro[i]]
  }
  rect.hclust(tor, k = k, border = gro + 1, cluster = gr)
  legend("topright", paste("Cluster", 1:k), pch = 22, col = 2:(k + 1), bty = "n")
}

hcoplot(spe.ch.ward, spe.ch, k = 4)

非層次聚類 (NHC)

這次我們將做一個k均值聚類模型。

1-數據標準化

之前已經做過。

2- 選擇聚類方法

set.seed(1) 
spe.kmeans <- kmeans(spe.norm, centers = 4, nstart = 100)

我們創建了包含4組的模型,與之前的HC模型相同。這裏是每一組的中心,每一組的方差。該模型可以解釋總方差的66.7%。

3- 選擇聚類數和模型驗證

我們使用以下標準:

  • Calinski & Harabasz 值
  • Simple structure index (SSI)
  • Sum of squared errors (SSE)
  • Silhouette 圖

3.1. Calinski method + SSI + SSE

spe.KM.cascade <- cascadeKM(spe.norm, inf.gr = 2, sup.gr = 10, iter = 100, criterion = "ssi")
spe.KM.cascade$results 
##      2 groups  3 groups  4 groups  5 groups  6 groups  7 groups   8 groups
## SSE 8.2149405 6.4768108 5.0719796 4.3015573 3.5856120 2.9523667 2.48405487
## ssi 0.1312111 0.1675852 0.1244603 0.1149501 0.1281785 0.1306085 0.07275788
##     9 groups  10 groups
## SSE 2.052189 1.75992916
## ssi 0.126707 0.07094594
plot(spe.KM.cascade, sortg = TRUE) 

統計數字不能決定一切。通過SSE方法,最好的聚類數必須是2,通過SSI方法則必須是3。

3.2. Silhouette 圖

我們試着繪製 3 組的輪廓係數圖。

spe.kmeans <- kmeans(spe.norm, centers = 3, nstart = 100)
dissE <- daisy(spe.norm) 
sk <- silhouette(spe.kmeans$cl, dissE) 
plot(sk)

4- 最後的圖形和解釋

現在我們繼續比較前一個樹狀圖和之前的分類。

spebc.ward.g <- cutree(spe.ch.ward,k = 4)
table(spe.kmeans$cluster, spebc.ward.g)
##    spebc.ward.g
##      1  2  3  4
##   1  0  6  0  0
##   2 11  1  0  0
##   3  0  0  8  3

它們只在一種情況的分類上有所不同。

clusplot(spe.norm, spe.kmeans$cluster, color = TRUE, shade = TRUE, 
         labels = 2, lines = 0)

組與組之間有一些重疊,但我們解釋了高達61.75%的變異性,這是一個很好的百分比。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章