數據分析-主成分分析流程(R語言)

主成分分析原理見: http://blog.sina.com.cn/s/blog_14154cb430102xjcc.html
主成分分析(principal component analysis,PCA)是一種降維技術,把多個變量化爲能夠反映原始變量大部分信息的少數幾個主成分
流程環節爲:
1、數據預處理。數值型,去缺失值,
2、主成分計算。
3、判斷要選擇的主成分數目。
4、選擇並解釋主成分。
5、計算主成分得分。
6、結果可視化。

具體流程
1、數據預處理

# 導入包和數據
> library(ggplot2)  # ggplot畫圖
> data("mtcars")   # 選用R內置數據集mtcars
> mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

2、主成分計算
R語言內置兩種主成分分析計算函數,princomp和prcomp,兩個函數的計算方式和出來的結果格式都有細微差異,我們將分別羅列

# 主成分計算-princomp
car.pr1 <- princomp(mtcars,cor=TRUE)
# 主成分計算-prcomp
car.pr2 <- prcomp(mtcars)

3、判斷要選擇的主成分數目。

# 碎石圖-princomp
screeplot(car.pr1,type="lines")

在這裏插入圖片描述

# 碎石圖-prcomp
screeplot(car.pr2,type="lines")

在這裏插入圖片描述

## 利用summary函數查看主成分貢獻率
# Standard deviation 標準差
# Proportion of Variance  單主成分貢獻率
# Cumulative Proportion  累積貢獻率
# 主成分貢獻率-princomp
> summary(car.pr1)
Importance of components:
                          Comp.1    Comp.2     Comp.3     Comp.4     Comp.5     Comp.6
Standard deviation     2.5706809 1.6280258 0.79195787 0.51922773 0.47270615 0.45999578
Proportion of Variance 0.6007637 0.2409516 0.05701793 0.02450886 0.02031374 0.01923601
Cumulative Proportion  0.6007637 0.8417153 0.89873322 0.92324208 0.94355581 0.96279183
                           Comp.7     Comp.8      Comp.9     Comp.10     Comp.11
Standard deviation     0.36777981 0.35057301 0.277572792 0.228112781 0.148473587
Proportion of Variance 0.01229654 0.01117286 0.007004241 0.004730495 0.002004037
Cumulative Proportion  0.97508837 0.98626123 0.993265468 0.997995963 1.000000000

# 主成分貢獻率-prcomp
> summary(car.pr2)
Importance of components:
                           PC1      PC2     PC3     PC4     PC5     PC6    PC7   PC8    PC9
Standard deviation     136.533 38.14808 3.07102 1.30665 0.90649 0.66354 0.3086 0.286 0.2507
Proportion of Variance   0.927  0.07237 0.00047 0.00008 0.00004 0.00002 0.0000 0.000 0.0000
Cumulative Proportion    0.927  0.99937 0.99984 0.99992 0.99996 0.99998 1.0000 1.000 1.0000
                         PC10   PC11
Standard deviation     0.2107 0.1984
Proportion of Variance 0.0000 0.0000
Cumulative Proportion  1.0000 1.0000

選擇前兩個主成分

# 貢獻率提取-princomp
> car.pv1 <- eigen(cor(mtcars))$values
> car.pv1 <- car.pv1/sum(car.pv1)
> car.pv1[1:2] # 展示前兩個
[1] 0.6007637 0.2409516

# 貢獻率提取-prcomp  對於prcomp,可以直接從summary中提取
car.pv2 <- summary(car.pr2)$importance
> car.pv2[2,1:2] # 展示前兩個
    PC1     PC2 
0.92700 0.07237 

4、選擇並解釋主成分。(載荷矩陣)

# 載荷矩陣-princomp
car.pr1$loadings[,1:2]
         Comp.1      Comp.2
mpg   0.3625305  0.01612440
cyl  -0.3739160  0.04374371
disp -0.3681852 -0.04932413
hp   -0.3300569  0.24878402
drat  0.2941514  0.27469408
wt   -0.3461033 -0.14303825
qsec  0.2004563 -0.46337482
vs    0.3065113 -0.23164699
am    0.2349429  0.42941765
gear  0.2069162  0.46234863
carb -0.2140177  0.41357106


# 載荷矩陣-prcomp
> car.pr2$rotation[,1:2]
              PC1          PC2
mpg  -0.038118199  0.009184847
cyl   0.012035150 -0.003372487
disp  0.899568146  0.435372320
hp    0.434784387 -0.899307303
drat -0.002660077 -0.003900205
wt    0.006239405  0.004861023
qsec -0.006671270  0.025011743
vs   -0.002729474  0.002198425
am   -0.001962644 -0.005793760
gear -0.002604768 -0.011272462
carb  0.005766010 -0.027779208

5、計算主成分得分。

# 計算主成分得分-princomp ,對於princomp,可以直接提取pca結果裏的scores ,或用predict提取
> car.pca1 <- car.pr1$scores[,1:2]  # 直接提取pca結果裏的scores,前兩列
> car.pca1 <- predict(car.pr1)[,1:2] # predict提取主成分,前兩列
> car.pca1
                           Comp.1     Comp.2
Mazda RX4            0.6572132031  1.7354457
Mazda RX4 Wag        0.6293955058  1.5500334
Datsun 710           2.7793970426 -0.1464566
Hornet 4 Drive       0.3117707086 -2.3630190
Hornet Sportabout   -1.9744889419 -0.7544022
Valiant              0.0561375337 -2.7859996
Duster 360          -3.0026742880  0.3348874
Merc 240D            2.0553287289 -1.4651808
Merc 230             2.2874083842 -1.9835265
Merc 280             0.5263812077 -0.1620126
Merc 280C            0.5092054932 -0.3238945
Merc 450SE          -2.2478104359 -0.6834740
Merc 450SL          -2.0478227622 -0.6832207
Merc 450SLC         -2.1485421615 -0.8017395
Cadillac Fleetwood  -3.8997903717 -0.8279481
Lincoln Continental -3.9541231097 -0.7333815
Chrysler Imperial   -3.5929719882 -0.4211349
Fiat 128             3.8562837567 -0.2967519
Honda Civic          4.2540325032  0.6884140
Toyota Corolla       4.2342207436 -0.2792875
Toyota Corona        1.9041678566 -2.1198383
Dodge Challenger    -2.1848507430 -1.0142171
AMC Javelin         -1.8633834347 -0.9064645
Camaro Z28          -2.8889945733  0.6808260
Pontiac Firebird    -2.2459189274 -0.8738121
Fiat X1-9            3.5739682964 -0.1212038
Porsche 914-2        2.6512550541  2.0463709
Lotus Europa         3.3857059882  1.3785993
Ford Pantera L      -1.3729574238  3.4999996
Ferrari Dino         0.0009899207  3.2190722
Maserati Bora       -2.6691258658  4.3796772
Volvo 142E           2.4205931001  0.2336399

# 計算主成分得分-prcomp  對於prcomp只能用predict提取
> car.pca2 <- predict(car.pr2)[,1:2]
> car.pca2
                            PC1         PC2
Mazda RX4            -79.596425    2.132241
Mazda RX4 Wag        -79.598570    2.147487
Datsun 710          -133.894096   -5.057570
Hornet 4 Drive         8.516559   44.985630
Hornet Sportabout    128.686342   30.817402
Valiant              -23.220146   35.106518
Duster 360           159.309025  -32.259197
Merc 240D           -112.615805   39.702195
Merc 230            -103.534591    7.513104
Merc 280             -67.046877   -6.208536
Merc 280C            -66.997514   -6.206387
Merc 450SE            55.211672  -10.373509
Merc 450SL            55.173910  -10.361893
Merc 450SLC           55.251602  -10.370934
Cadillac Fleetwood   242.814893   52.501758
Lincoln Continental  236.369886   38.280788
Chrysler Imperial    224.737944   16.111941
Fiat 128            -172.363654    6.575522
Honda Civic         -181.066911   17.783639
Toyota Corolla      -179.697852    4.188212
Toyota Corona       -121.224099   -3.345362
Dodge Challenger      80.159386   34.983214
AMC Javelin           67.572431   28.894067
Camaro Z28           150.354631  -36.633575
Pontiac Firebird     164.652522   48.239880
Fiat X1-9           -171.897231    6.643746
Porsche 914-2       -123.804988    2.033356
Lotus Europa        -137.082789  -28.675647
Ford Pantera L       159.413222  -53.318347
Ferrari Dino         -64.762396  -62.954280
Maserati Bora        145.361703 -139.049149
Volvo 142E          -115.181783  -13.826313

6 結果可視化

# 主成分拼接
type <- sample(1:5,nrow(mtcars),replace = T) #mtcar.沒有分組變量,我們隨機分成5組
car.pdata1 <- data.frame(name=rownames(car.pca1),car.pca1)
car.pdata1$type <- factor(type)
car.pdata2 <- data.frame(name=rownames(car.pca2),car.pca2)
car.pdata2$type <- factor(type)

展示主成分及分組置信橢圓-princomp

pca_plot1 <- ggplot(car.pdata1, aes(Comp.1, Comp.2 ,color = type,shape=type)) + 
  geom_point(size=2)+ 
  # 置信橢圓
  stat_ellipse(aes(group = type,fill=type),show.legend = F,geom = "polygon",alpha = 0.2) +
  geom_vline(xintercept = 0, size = 0.2,linetype = 2) + #在 x=0 處添加垂直線
  geom_hline(yintercept = 0, size = 0.2,linetype = 2) + #在 y=0 處添加水平線
  theme(legend.title=element_blank())+ # 圖例標題爲空
  labs(x= paste0("Comp.1(", round(car.pv1[1]*100,2), "%)"),
       y= paste0("Comp.2(", round(car.pv1[2]*100,2), "%)"),title = "Individuals-PCA1")
pca_plot1  

在這裏插入圖片描述
展示主成分及分組置信橢圓-prcomp

pca_plot2 <- ggplot(car.pdata2, aes(PC1, PC2 ,color = type,shape=type)) + 
  geom_point(size=2)+ 
  # 置信橢圓
  stat_ellipse(aes(group = type,fill=type),show.legend = F,geom = "polygon",alpha = 0.2) +
  geom_vline(xintercept = 0, size = 0.2,linetype = 2) + #在 x=0 處添加垂直線
  geom_hline(yintercept = 0, size = 0.2,linetype = 2) + #在 y=0 處添加水平線
  theme(legend.title=element_blank())+ # 圖例標題爲空
  labs(x= paste0("PC1(", round(car.pv2[2,1]*100,2), "%)"),
       y= paste0("PC2(", round(car.pv2[2,2]*100,2), "%)"),title = "Individuals-PCA2")
  
pca_plot2 

在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章