ChAMP 包分析450K甲基化芯片數據（一站式）

早在我們舉辦甲基化芯片專題學習的時候，見：

450K甲基化芯片數據處理傳送門

就有非常棒的一站式教程投稿，也因此我結識了優秀的六六，以及其教程大力推薦的R包作者，見：

850K甲基化芯片數據的分析

但是當時的教程題目並沒有着重宣傳該R包，恰好技能樹聯盟新成員也總結了自己的經驗，成員介紹見：

我與生信技能樹的故事

那麼我們就一起學習其優秀的總結筆記吧！

ChAMP PACKAGE

▛ 用來分析illuminate甲基化數據的包 (EPIC and 450k) ▟

⊙ 不同格式的數據導入

| .idat files

| a beta-valued matrix

⊙ Quality Control plots

⊙ Type-2 探針的矯正方法

| SWAN1

| Peak Based Correction (PBC)2

| BMIQ3 (the default choice)

⊙ Functional Normalization function｜the minfi package

⊙ 查看批次效應的方法｜singular value decomposition (SVD) method，for correction of multiple batch effects the ComBat method.

⊙ 矯正cell-type heterogeneity｜RefbaseEWAS

⊙ 推斷CNV變異

⊙ Differentially Methylated Regions (DMR)

| Lasso method

| Bumphunter

| DMRcate

⊙ find Differentially Methylated Blocks

⊙ Gene Set Enrichment Analysis (GSEA)

⊙ Infer gene modules in user-specified gene-networks that exhibit differential methylation between phenotypes （整合FEM package）

⊙ 其他分析甲基化數據的包

| IMA

| minfi

| methylumi

| RnBeads

| wateRmelon

1、安裝ChAMP包

source("https://bioconductor.org/biocLite.R")
biocLite("ChAMP")

source("http://bioconductor.org/biocLite.R")
biocLite(c("minfi","ChAMPdata","Illumina450ProbeVariants.db","sva","IlluminaHumanMethylation450kmanifest","limma"))

biocLite("YourErrorPackage")

library("ChAMP")


如果報錯：
錯誤: package or namespace load failed for 'ChAMP' in inDL(x, as.logical(local), as.logical(now), ...):
無法載入共享目標對象‘D:/work/R-3.4.3/library/mvtnorm/libs/x64/mvtnorm.dll’：:
已達到了DLL數目的上限...

解決方案:
設置環境變量R_MAX_NUM_DLLS, 不管是什麼操作系統，R語言對應的環境變量都可以在.Renviron文件中進行設置。

這個文件可以保存在任意目錄下，文件中就一句話，內容如下:

R_MAX_NUM_DLLS=500

500表示允許的最多的dll文件數目，設置好之後，重新啓動R, 然後輸入如下命令:

normalizePath("d:/Documents/.Renviron", mustWork = FALSE)

第一個參數爲.Renviron文件的真實路徑，然後在加載ChAMP包就可以了。

2、用測試數據跑流程

測試數據包括 450k(.idat) 和 850k(simulated EPIC data) 兩個數據集

testDir=system.file("extdata",package="ChAMPdata")
myLoad <- champ.load(testDir,arraytype="450K")

data(EPICSimData)

3、ChAMP Pipeline

- 綠色發光線表示主要的分析步驟

- 灰色線條爲可選的步驟

- 黑點表示準備好的甲基化數據

- 藍色表示準備工作：Loading, Normalization, Quality Control checks

- 紅色表示產生分析結果：Differentially Methylated Positions (DMPs), Differentially Methylated Regions (DMRs), Differentially methylated Blocks, EpiMod (a method for detecting differentially methylated gene modules derived from FEM package), Pathway Enrichment Results etc.

- 黃色表示交互界面畫圖

450K步驟

Full Pipeline

一步跑完結果，但是可能報錯

champ.process(directory = testDir)

一步一步跑

myLoad <- cham.load(testDir)
# Or you may separate about code as champ.import(testDir) + champ.filter()
CpG.GUI()
champ.QC() # Alternatively: QC.GUI()
myNorm <- champ.norm()
champ.SVD()
# If Batch detected, run champ.runCombat() here.
myDMP <- champ.DMP()
DMP.GUI()
myDMR <- champ.DMR()
DMR.GUI()
myBlock <- champ.Block()
Block.GUI()
myGSEA <- champ.GSEA()
myEpiMod <- champ.EpiMod()
myCNA <- champ.CNA()

# If DataSet is Blood samples, run champ.refbase() here.
myRefbase <- champ.refbase()

EPIC pipeline

data(EPICSimData)
CpG.GUI(arraytype="EPIC")
champ.QC() 
myNorm <- champ.norm(arraytype="EPIC")
champ.SVD()

myDMP <- champ.DMP(arraytype="EPIC")
DMP.GUI()
myDMR <- champ.DMR()
DMR.GUI()
myDMR <- champ.DMR(arraytype="EPIC")
DMR.GUI(arraytype="EPIC")
myBlock <- champ.Block(arraytype="EPIC")
Block.GUI(arraytype="EPIC") 
myGSEA <- champ.GSEA(arraytype="EPIC")
myEpiMod <- champ.EpiMod(arraytype="EPIC")

最多在8G內存電腦上可以跑200個樣本，如果在服務器上多核跑，需要命令

library("doParallel")
detectCores()

ChAMP pipeline

1. Loading Data

.idat files 爲原始芯片文件，包括pd file (Sample_Sheet.csv)文件（表型，編號等）

myLoad$pd

2. Filtering Data

ChAMP提供了 champ.filter() 函數，可以輸入 (beta, M, Meth, UnMeth, intensity)格式的文件並進行過濾質控。新版本的ChAMP包中champ.load()函數已經包含了此功能。
champ.filter() 函數有個參數autoimpute，可以填補或保留由過濾導致的NA空缺值。
如果輸入多個數據框進行過濾，他們的行名和列名必須一致，否則champ.filter()認爲是不同來源的數據，將停止過濾。
低質量的樣本（有較多的探針沒有信號）將會被過濾掉，Sample_Name 要與pd file中的列名稱一致。
imputation需要detection P matrix, beta or M matrix信息，且ProbeCutoff 不能等於0，這個參數控制探針的NA ratio，來決定是否填補。
如果想用beadcount信息進行過濾，champ.import() 函數會返回beads信息。使用方法爲：

myImport <- champ.import(testDir)
myLoad <- champ.filter()

Section 1: Read PD Files Start： Reading CSV File
Section 2: Read IDAT files Start：Extract Mean value for Green and Red Channel Success
Your Red Green Channel contains 622399 probes.
Section 3: Use Annotation Start：Reading 450K Annotation，there are 613 control probes in Annotation，Generating Meth and UnMeth Matrix，485512 Meth probes
Generating beta Matrix
Generating M Matrix
Generating intensity Matrix
Calculating Detect P value
Counting Beads

You may want to process champ.filter() next，This function is provided for user need to do filtering on some beta (or M) matrix, which contained most filtering system in champ.load except beadcount.

Section 1: Check Input Start：You have inputed beta,intensity for Analysis.
Checking Finished :filterDetP,filterBeads,filterMultiHit,filterSNPs,filterNoCG,filterXY would be done on beta,intensity.
You also provided :detP,beadcount .

Section 2: Filtering Start
The fraction of failed positions per sample
Failed CpG Fraction.
C1 0.0013429122
C2 0.0022162171
C3 0.0003563249
C4 0.0002842360
T1 0.0003831007
T2 0.0011946152
T3 0.0014953286
T4 0.0015447610
Filtering probes with a detection p-value above 0.01.
Removing 2728 probes.

Filtering BeadCount Start
Filtering probes with a beadcount <3 in at least 5% of samples.
Removing 9291 probes

Filtering NoCG Start
Only Keep CpGs, removing 2959 probes from the analysis.

Filtering SNPs Start
Using general 450K SNP list for filtering.
Filtering probes with SNPs as identified in Zhou's Nucleic Acids Research Paper 2016.
Removing 49231 probes from the analysis.

Filtering MultiHit Start
Filtering probes that align to multiple locations as identified in Nordlund et al
Removing 7003 probes from the analysis.

Filtering XY Start
Filtering probes located on X,Y chromosome, removing 9917 probes from the analysis.

Updating PD file

Fixing Outliers Start
Replacing all value smaller/equal to 0 with smallest positive value.
Replacing all value greater/equal to 1 with largest value below 1..

過濾步驟爲：

detection p-value (< 0.01)。這個值儲存在.idat文件中，champ.import()函數讀入這個值並形成數據框。p< 0.01的探針認爲實驗失敗。過濾過程爲：樣本探針失敗率閾值=0.1，再在剩下的樣本中過濾探針。參數SampleCutoff 和 ProbeCutoff控制這兩個閾值。
ChAMP will filter out probes with <3 beads ( filterBeads 參數控制) in at least 5% （beadCutoff 參數控制）of samples per probe.
默認過濾non-CpG probes
by default ChAMP will filter all SNP-related probe。需要用population參數選擇羣體。如果不選，用General Recommended Probes provided by Zhou to do filtering。
ChAMP will filter all multi-hit probes.
默認過濾掉chromosome X and Y上的探針。filterXY 參數控制。如果沒有原始的.IDAT 數據，用champ.filter() 函數進行過濾。

注意：

champ.load() can not perform filtering on beta matrix alone. For users have no .IDAT data but beta matrix and Sample_Sheet.csv, you may want perform filtering using the champ.filter() function and then use following functions to do analysis.

CpG.GUI() 函數查看甲基化位點的分佈情況。CpGs on chromosome, CpG island, TSS reagions.

CpG.GUI(CpG=rownames(myLoad$beta),arraytype="450K")

3. Further quality control and exploratory analysis

用champ.QC() function and QC.GUI() function檢查數據質量

champ.QC()

champ.QC()函數會生成三個圖：

mdsPlot (Multidimensional Scaling Plot): 基於前1000個最易變化的位點查看樣本的相似度，用顏色標記不同的樣本分組。

densityPlot: 查看每個樣本的beta值分佈，有嚴重偏離的樣本預示着質量較差（如亞硫酸鹽處理不完全等）

dendrogram:所有樣本的聚類圖。champ.QC()函數中Feature.sel="None" 參數表示直接通過探針數值來計算樣本的距離，比較耗內存；還有 “SVD” method。

QC.GUI() 函數也可以畫圖，但是比較耗內存。包括5張圖：mdsPlot, type-I&II densityPlot, sample beta distribution plot, dendrogram plot and top 1000 most variable CpG’s heatmap.

QC.GUI(beta=myLoad$beta,arraytype="450K")

type-I&II densityPlot圖可以幫助查看兩個探針的標準化狀態。
Top variable CpGs’ heatmap將前1000個差異最大的位點和狀態表示出來。

4. Normalization

type-I 和 type-II 兩種探針化學反應不同，設計也不同，導致分佈區域也不同。這兩種探針檢測出的差異可能是因爲探針所在位置不平衡導致的生物學差異引起的（如CpG位置的差異引起的）。最主要是type-II 探針exhibit a reduced dynamic range. 因此，針對 type-II probe bias的矯正是必要的。 champ.norm() 函數可以實現這個功能。針對type-II 探針有4種標準化的方法：BMIQ, SWAN, PBC 和 unctionalNormliazation。 850k 芯片用BMIQ標準化要好一點。但是BMIQ對質量差的樣本或者甲基化偏差比較大的control樣本效果不好。“cores”參數控制電腦核數，PDFplot=TRUE將圖保存在resultsDir裏。

myNorm <- champ.norm(beta=myLoad$beta,arraytype="450K",cores=5)

QC.GUI(myNorm,arraytype="450K")

5. SVD Plot

The singular value decomposition method (SVD) 用來用於評估數據集中變量的主要成分。這些顯著性位點可能與我們感興趣的生物學現象相關聯，也可能與技術相關，如批次效應或羣體效應。樣本的病歷信息越詳細越好（如：dates of hybridization, season in which samples were collected, epidemiological information, etc），可以將這些因素包含進SVD中。如果從 .idat導入原始文件，設置champ.SVD()函數的RGEffect=TRUE ，芯片上18個內置的對照探針（包括亞硫酸鹽處理效率）將納入確信的因素進行分析。 champ.SVD()函數將把pd文件中的所有協變量和表型數據納入進行分析。可以用cbind()函數將自己的協變量與myLoad$pd合併進行分析。但是對於分類變量和數字變量處理方法是不一樣的。分類變量要轉換成“factor” or “character”類型，數字變量轉換成數字類型。 champ.SVD()分析時會把協變量打印在屏幕上，結果是熱圖，保存爲SVDsummary.pdf文件。黑色表示最顯著的p值。如果發現技術因素有影響，就需要用ComBat等方法重新標準化數據，包括variation related to the beadchip, position and/or plate。

champ.SVD(beta=myNorm,pd=myLoad$pd)

上圖是用自帶的測試數據繪製的，不是很複雜，看不出來。下圖用GSE40279的656個樣本繪製的。其中年齡是數字變量，其他都爲分類變量。

6. Batch Effect Correction

ComBat方法是sva 包裏的一個方法，已經整合到ChAMP包裏了，batchname=c("Slide")參數控制矯正因素。champ.runCombat() 函數自動把Sample_Group作爲協變量矯正，現在又加入了另一個參數variablename用來加入自己的協變量進行矯正。如果用戶在 champ.runCombat()函數中寫的 batchname正確，函數將自動進行批次效應矯正。 ComBat如果直接用beta值進行矯正，輸出可能不在0-1之間，所以計算機在計算前需要做一個變換。如果用M-values矯正，參數 logitTrans=FALSE設置。有時候批次效應和變異會混雜在一起，如果矯正了批次效應，變異也會消失，

myCombat <- champ.runCombat(beta=myNorm,pd=myLoad$pd,batchname=c("Slide"))

champ.SVD()

7.1 Differential Methylation Probes（DMP & DMR & DMB)

目的是找出幾百萬CpG中的哪些在疾病中發生了變化，而這些變化又是如何導致了基因發生了變化，最終導致了人體生病。

DMP代表找出Differential Methylation Probe（差異化CpG位點），DMR代表找出Differential Methylation Region（差異化CpG區域），Block代表Differential Methylation Block（更大範圍的差異化region區域）。

champ.DMP() 實現了 limma包中利用linear model計算差異甲基化位點的p-value。最新的champ.DMP()包支持分析數值型變量如年齡，分類型變量如包含多個表型的：“tumor”, “metastasis”, “control”。數值型變量（如年齡）會用linear regression模型作爲協變量進行分析，to find your covariate-related CpGs, say age-related CpGs.分類型變量會按類型分類進行比較，如比較“tumor–metastatic”, “tumor-control”, and “metastatis-control”之間的差異，結果會輸出一個數據框，包含差異的探針：P-value, t-statistic and difference in mean methylation（被轉換爲logFC，類似於RNA-seq中的log fold-change）。還包括每個探針的註釋，相同組的平均beta值，兩組之間的delta beta值（與 logFC相同的意思，老版本的包需要）。高級用戶可以用limma 包進一步用輸出的探針及p值進行DMR分析。

myDMP <- champ.DMP(beta = myNorm,pheno=myLoad$pd$Sample_Group)

head(myDMP[[1]])

champ.DMP()返回的是list，新版本的ChAMP包含GUI交互界面檢查myDMP的結果。用戶提供未經修改的champ.DMP (myDMP)函數產生的orginal beta matrix結果和covariates，DMP.GUI() 函數自動檢測covariates是數值型還是分類型。分類型如case/control， DMP.GUI()自動畫出顯著性差異位點。

DMP.GUI(DMP=myDMP[[1]],beta=myNorm,pheno=myLoad$pd$Sample_Group)

7.2 Hydroxymethylation Analysis 羥甲基化

一些用戶想做羥甲基化，下面爲示例代碼

myDMP <- champ.DMP(beta=myNorm, pheno=myLoad$pd$Sample_Group, compare.group=c("oxBS", "BS"))


hmc <- myDMP[[1]][myDMP[[1]]$deltaBeta>0,]

8. Differential Methylation Regions 差異甲基化區域

DMRs主要指一連串的CpG都會出現很明顯的差異，champ.DMR()函數計算並返回一個數據框，包括：detected DMRs, with their length, clusters, number of CpGs annotated. 函數包含三種算法Bumphunter, ProbeLasso and DMRcate. Bumphunter比較可靠，精確度可以有90%以上，ProbeLasso有75%左右，DMRcate是後來集成進去的，沒有評測過。Bumphunter 算法首先將所有的探針分成幾小類，然後用隨機permutation方法評估候選的DMRs.

myDMR <- champ.DMR(beta=myNorm,pheno=myLoad$pd$Sample_Group,method="Bumphunter")
head(myDMR$DMRcateDMR)
DMR.GUI(DMR=myDMR)

9. Differential Methylation Blocks

在Block-finder 功能中，champ.Block()函數首先在全基因組範圍上計算small clusters (regions) ，然後對於每個cluster，計算平均值和位置，將每個區域壓縮爲一個單元。 When we finding DMB, only single unit from open sea would be used to do clustering. Here Bumphunter algorithm will be used to find “DMRs” over these regions (single units after collapse). In our previous paper23, and other scientists’ work24 we demonstrated that Differential Methylated Blocks may show universal feature across various cancers

myBlock <- champ.Block(beta=myNorm,pheno=myLoad$pd$Sample_Group,arraytype="450K")
head(myBlock$Block)
Block.GUI(Block=myBlock,beta=myNorm,pheno=myLoad$pd$Sample_Group,runDMP=TRUE,compare.group=NULL,arraytype="450K")

10. Gene Set Enrichment Analysis

尋找作用通路網絡中的疾病關聯小網絡 After previous steps, you may already get some significant DMPs or DMRs, thus you may want to know if genes involved in these significant DMPs or DMRs are enriched for specific biological terms or pathways. To achieve this analysis, you can use champ.GSEA() to do GSEA analysis.champ.GSEA() would automatically extract gene information, transfer CpG information into gene information then conduct GSEA on each list.

There are two ways to do GSEA. In previous version, ChAMP used pathway information downloaded fromMSigDB. ThenFisher Exact Test will be used to calculate the enrichment status of each pathway. After gene enrichment analysis, champ.GSEA() function would automatically return pathways with P-value smaller then adjPval cutoff.

However, as pointed out by Geeleher [citation], since different genes has different numbers of CpGs contained inside, the two situation that one genes with 50 CpGs inside but only one of them show significant methylation, and one gene with 2 CpGs inside but two are significant methylated should not be eaqualy treated. The solution is use number of CpGs contained by genes to correct significant genes. as implemented in the gometh function from missMethyl package25. In gometh function, it used number of CpGs contained by each gene replace length as biased data, to correct this issue. The idea of gometh is fitting a curve for numbers of CpGs across genes related with GSEA, then using the probability weighting function to correct GO’s p value.

champ.GSEA() function as “goseq” to use goseq method to do GSEA, or user may set it as “fisher” to do normal Gene Set Enrichment Analysis.

myGSEA <- champ.GSEA(beta=myNorm,DMP=myDMP[[1]], DMR=myDMR, arraytype="450K",adjPval=0.05, method="fisher")
# myDMP and myDMR could (not must) be used directly.

11. Differential Methylated Interaction Hotspots

champ.EpiMod() This function usesFEM package to infer differentially methylated gene modules within a user-specific functional gene-network. This network could be e.g. a protein-protein interaction network. Thus, the champ.EpiMod() function can be viewed as a functional supervised algorithm, which uses a network of relations between genes (usually a PPI network), to identify subnetworks where a significant number of genes are associated with a phenotype of interest (POI). The EpiMod algorithm can be run in two different modes: at the probe level, in which case the most differentially methylated probe is assigned to each gene, or at the gene-level in which case a DNAm value is assigned to each gene using an optimized procedure described in detail in Jiao Y, Widschwendter M, Teschendorff AE Bioinformatics 2014. Originally, the FEM package was developed to infer differentially methylated gene modules which are also deregulated at the gene expression level, however here we only provide the EpiMod version, which only infers differentially methylated modules. More advanced user may refer to FEM package for more information.

myEpiMod <- champ.EpiMod(beta=myNorm,pheno=myLoad$pd$Sample_Group)

12. Cell Type Heterogeneity

由於DNA甲基化有高的細胞特異性，許多DMPs/DMRs的變化是由細胞成分導致的。許多方法可以矯正這個問題：RefbaseEWAS用組織的細胞類型做參考數據庫，確定細胞比例。In ChAMP, we include a reference databases for whole blood, one for 27K and the other for 450K. After champ.refbase() function, cell type heterogeneity corrected beta matrix, and cell-type specific proportions in each sample will be returned. Do remember champ.refbase() can only works on Blood Sample Data Set.

myRefBase <- champ.refbase(beta=myNorm,arraytype="450K")
# Our test data set is not whole blood. So it should not be run here.

參考：https://bioconductor.org/packages/release/bioc/vignettes/ChAMP/inst/doc/ChAMP.html http://blog.csdn.net/joshua_hit/article/details/54982018 https://www.jianshu.com/p/6411e8acfab3

ChAMP 包分析450K甲基化芯片數據（一站式）

注意：

8. Differential Methylation Regions 差異甲基化區域

9. Differential Methylation Blocks

10. Gene Set Enrichment Analysis

11. Differential Methylated Interaction Hotspots

12. Cell Type Heterogeneity

python gdal 安裝使用（Windows， python 3.6.8）

計算MiRNA–mRNA表達相關性

通常自己的目標基因要在公共數據庫看是否影響生存

TCGA數據庫的腫瘤病人也是有藥物反應信息的

天真的我準備把全部流程遷移到GATK4

多個探針對應同一個基因取最大表達量探針極簡代碼

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結