初識R語言——PCA的實現

初識R語言 —— PCA的實現

回顧PCA

在之前的文章（老嫗能解PCA）中曾經寫過一些自己的PCA的看法，今天嘗試用R語言來進行PCA的實現。回顧一下什麼是PCA，總結來說就是基於對各個特徵之間相關性的分析，從而找到主要成分並選取一定個數的特徵向量作爲新的基，從而得到樣本在以新的基所構成的空間中的映射作爲新的樣本值，也就達到了降維的目的。

數據描述

這次數據使用的是真實數據，數據的描述如下：

Human body consists of about 70 trillion cells, where each of the cells have DNA molecules called genome (Figure 1). Here, the genome is only a storage unit for genetic information, which needs to be partially copied into a smaller unit called RNA for the actual utility (Figure 1). Each RNA molecule is much smaller than genome and only contains information of a single gene, while genome has genetic information of every genes. Here, the process partially copying genome into RNA is called transcription (Figure 1). After RNAs are transcribed from genome, they subsequently converted into polypeptides (or proteins), which are the actual machineries running cellular processes. This conversion is named as translation to distinguish from transcription (Figure 1).

Figure 1. Description of transcription process
Cell needs to transcribe each gene only when it is needed, so each gene need to be selectively transcribed. And the control which gene is transcribed or not depends on the control unit called transcription factor, which is itself a protein. In general, multiple transcription factors are needed to transcribe a single gene, where the transcription factors are combined into a single protein-complex along with other mediator proteins. By bending DNA molecules into U-shaped structure, transcription factor complex make a physical force to move forward the copying machinery for transcription process (Figure 2). Therefore, the quantity of each gene’s transcript is controlled by the quantity of the corresponding transcription factors.

Figure 2. Description of transcription factor’s action
As described, the biology of transcription process is well studied and the mechanism is quite straightforward. So, the only remaining problem is matching which factors control which genes. Extracting this information by using regression model is a famous problem in bioinformatics. We provide you a genome-wide profile of RNA quantity to make a regression model. The data we provide is RNA quantity data extracted from 20 different people. Each people have a single target gene (TG) as a dependent variable, and nine transcription factors (TF) as input variables.

簡單來說就是描述了DNA轉錄的過程，生物的確是我高中時期的噩夢，高中畢業這麼久爲數不多的學到的知識，也已經還給老師了，真的是慚愧。。。根據描述，轉錄過程並不是對DNA全數的複製，而是有選擇性的，這個過程是有一個可以學習的機制的。target gene（TG）是受9個 transcription factors(TF) 所控制的。而該樣本是對20個人的採樣。

R語言實現PCA

樣本數據讀入

> gene_table <- read.table("GeneExpr.Table.txt", sep="\t", head=T)
> gene_table

GeneExpr 就是我們的樣本文件，”head = T” 的作用是是否將第一行設置成列名。如果“head = F”則表示我們將第一行也當做數據處理。

主要成分分析

其實R語言中有兩個常用的函數，”princomp” 和 “prcomp”，來實現PCA。這是因爲PCA的實現一般有兩種，一種是用特徵值（correlation和covariance）分解去實現的，一種是用奇異值（svd）分解去實現的。但實際上我們會發現兩者的結果不會有太大的差別。當然這裏有一點需要提醒，prcomp中有一個“scale”參數，是將一組數進行處理,默認情況下是將一組數的每個數都減去這組數的平均值後再除以這組數的均方根。

而princomp中有一個“cor”參數，來決定是通過correlation還是covariance來計算。我們曾經在PCA那裏講過求特徵值前樣本 X 需要減去平均值，所以這裏就要注意兩個參數的對應問題。

> pca_princomp <- princomp(gene_table[,-10],cor=T)
> pca_prcomp <- prcomp(gene_table[,-10],scale=T)

這裏princomp中cor取的是 TRUE，說明採用的是 correlation，所以相應的prcomp中也要進行scale操作。

這時候我們發現各主要成分是相似的，只是方向上有些差異。但是如果我們把兩個函數中其中一個改變參數，那麼結果就會有差異。比如我們將 princomp中的cor改爲F，意味着現在要用covariance進行計算，就會得到不同的結果。

現假設都scale和cor的值都是F，我們使用語句：

> biplot(pca_prcomp, col = c("black", "white"))

來觀察前兩維的結果。

得到降維後數據

我們已經得到了主要成分是不夠的，我們還需要得到降維後的數據。

> pca_data <- predict(pca_prcomp)

使用predict函數來得到PCA之後在各主要成分空間的數據。

對比上面的gene_table，我們發現還是有變動的。

我們如果想降到2維，那麼只用取前兩列的數據，

> plot(pca_data[,1:2])

可以看到結果和之前的繪圖相似。

初識R語言——PCA的實現