發表於2017年,雜誌是 JOURNAL OF CLINICAL ONCOLOGY 影響因子26.303 , 文章是 CpG Methylation Signature Predicts Recurrence in Early-Stage Hepatocellular Carcinoma: Results From a Multicenter Study 亮點應該是自己的數據,然後使用了兩個機器學習算法:
- LASSO, Least Absolute Shrinkage and Selector Operation;
- SVM-RFE, Support Vector Machine-Recursive Feature Elimination;
前面我們講解了一篇2013年多組學數據探索乳腺癌細胞系藥物敏感性使用的也是兩個機器學習算法,不過是LS-SVM和RF,但是也有借鑑意義。
課題設計
自己的450K甲基化芯片數據上傳到了:GSE75041
本項目共納入 576 patients with Early-stage hepatocellular carcinoma (E-HCC) ,其中
- 66 tumor samples were analyzed using the Illumina Methylation 450k Beadchip.
- internal cohort (n = 141) and two external cohorts (n = 191 and n =104).
也就是先小隊列做450K拿到感興趣的甲基化位點,然後擴大隊列只測量感興趣的甲基化位點證明自己拿到的位點是有臨牀價值的,整體課題設計如下:
項目納入的病人來源:
- 347 E-HCC samples at the Sun Yat-sen University Cancer Center (SYSUCC)
- 295 samples at three independent centers as follows:
- 191 samples from the First Affiliated Hospital of Sun Yat-sen University
- 57 samples from Guangzhou Medical University Cancer Center (GZMUCC)
- 47 samples from the First Affiliated Hospital of Anhui Medical University (AHMUFH).
文章的introduction部分肯定是介紹 E-HCC疾病的重要性,還有甲基化信號的重要性。
當然,也不落俗套的在 The Cancer Genome Atlas (TCGA) database 數據庫進行驗證。
數據處理
首先,復發與否的66個腫瘤樣本數據找差異甲基化位點,得到 a list of 2,550 differential CpGs
然後使用 LASSO algorithm to identify a set of 30 CpGs
接着使用 SVM-RFE algorithm and selected a set of 30 CpGs
兩個算法有14個CpG位點的交集,如下圖所示:
其中並集是46個,可以看熱圖如下:
繼續使用 penalized Cox regression model ,最後縮小到3個甲基化位點:
- cg20657849, SCAN domain containing 3 (SCAND3)
- cg19406367, Src homology 3-domain growth factor receptor-bound 2-like interacting protein 1 (SGIP1)
- cg19931348 ,peptidase inhibitor 3 (PI3)
算法的效果如下;
同時也根據這3個甲基化位點,構建了風險模型公式:risk score = (0.104 × methylation level of SGIP1) + (−1.125 × methylation level of SCAND3) + (−0.085 × methylation level of PI3).
並且稱之爲: a methylation-based signature for patients with E-HCC (MSEH)
然後就可以去驗證集裏面去看看預測效果。
生存分析驗證模型效果
在開頭我們介紹的數據集裏面,作者都使用了生存分析,很顯著的發現這3個甲基化位點組成的a methylation-based signature for patients with E-HCC (MSEH) 具有很好的區分效果,如下圖:
因爲作者驗證的數據集已經有3個了,所以在TCGA的驗證作者只是放在附件。
In addition, the predictive value of MSEH was validated further in the TCGA data. MSEH successfully discriminated 125 patients with TNM stage I into high-risk and low-risk groups in terms of both RFS and OS (P , .001, P = .043, respectively; Data Supplement).
感興趣的朋友也可以很容易去下載TCGA的肝癌的甲基化信號矩陣,來根據這3個甲基化位點組成的a methylation-based signature for patients with E-HCC (MSEH) 來進行驗證。