Python腳本下載TCGA大數據,非常簡單,開放源代碼

前言

使用TCGA官方的gdc-client下載工具有時候很慢,經常會掛掉,那乾脆自己寫一個下載小程序。於是使用TCGA的API寫了個下載TCGA數據的腳本,腳本也是需要下載manifest文件的。

環境

後面有把程序打包成EXE,包含命令行的和圖形界面的,讓沒有python的同學也能用

環境:Python3.6
函數包:

  • os
  • pandas
  • requests
  • sys
  • argparse
  • signal

代碼

# coding:utf-8
'''
This tool is to simplify the steps to download TCGA data.The tool has two main parameters,
-m is the manifest file path.
-s is the location where the downloaded file is to be saved (it is best to create a new folder for the downloaded data).
This tool supports breakpoint resuming. After the program is interrupted, it can be restarted,and the program will download file after the last downloaded file. Note that this download tool converts the file in the past folder format directly into a txt file. The file name is the UUID of the file in the original TCGA. If necessary, press ctrl+c to terminate the program.
author: chenwi
date: 2018/07/10
mail: [email protected]
'''
import os
import pandas as pd
import requests
import sys
import argparse
import signal

print(__doc__)

requests.packages.urllib3.disable_warnings()


def download(url, file_path):
    r = requests.get(url, stream=True, verify=False)
    total_size = int(r.headers['content-length'])
    # print(total_size)
    temp_size = 0

    with open(file_path, "wb") as f:

        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                temp_size += len(chunk)
                f.write(chunk)
                done = int(50 * temp_size / total_size)
                sys.stdout.write("\r[%s%s] %d%%" % ('#' * done, ' ' * (50 - done), 100 * temp_size / total_size))
                sys.stdout.flush()
    print()


def get_UUID_list(manifest_path):
    UUID_list = pd.read_table(manifest_path, sep='\t', encoding='utf-8')['id']
    UUID_list = list(UUID_list)
    return UUID_list


def get_last_UUID(file_path):
    dir_list = os.listdir(file_path)
    if not dir_list:
        return
    else:
        dir_list = sorted(dir_list, key=lambda x: os.path.getmtime(os.path.join(file_path, x)))

        return dir_list[-1][:-4]


def get_lastUUID_index(UUID_list, last_UUID):
    for i, UUID in enumerate(UUID_list):
        if UUID == last_UUID:
            return i
    return 0


def quit(signum, frame):
    # Ctrl+C quit
    print('You choose to stop me.')
    exit()
    print()


if __name__ == '__main__':

    signal.signal(signal.SIGINT, quit)
    signal.signal(signal.SIGTERM, quit)

    parser = argparse.ArgumentParser()
    parser.add_argument("-m", "--manifest", dest="M", type=str, default="gdc_manifest.txt",
                        help="gdc_manifest.txt file path")
    parser.add_argument("-s", "--save", dest="S", type=str, default=os.curdir,
                        help="Which folder is the download file saved to?")
    args = parser.parse_args()

    link = r'https://api.gdc.cancer.gov/data/'

    # args
    manifest_path = args.M
    save_path = args.S

    print("Save file to {}".format(save_path))

    UUID_list = get_UUID_list(manifest_path)
    last_UUID = get_last_UUID(save_path)
    print("Last download file {}".format(last_UUID))
    last_UUID_index = get_lastUUID_index(UUID_list, last_UUID)

    for UUID in UUID_list[last_UUID_index:]:
        url = os.path.join(link, UUID)
        file_path = os.path.join(save_path, UUID + '.txt')
        download(url, file_path)
        print(f'{UUID} have been downloaded')

使用方法

在命令行中命令就行:

python tcga_download.py -m manifest-xx.txt -s xxx

講解:
manifest-xx.txt 是你下載的manifest文件路徑
xxx是你下載的文件像保存到的那個文件夾(這個文件夾最好是新建的空文件夾)

演示:
這裏寫圖片描述

將程序打包成EXE

最後對於那些沒有安裝Python的人來說,可以使用我打包好的工具tcga_download.exe來下載TCGA數據,簡單方便,有點類似gdc-client這個工具,哈哈哈,不過自己寫的還是有成就感吧,後期打算做成QT界面版本的,點點鼠標就行。
tcga_download.exe放在網盤裏了,有需要可以自行下載
鏈接:https://pan.baidu.com/s/1AGyZ5cAyPUK06zqiQGx-nQ 密碼:3os4

演示:
這裏寫圖片描述

圖形界面的下載EXE

點點鼠標就能下載的小公舉exe:
下載地址:https://github.com/chenwi/TCGAD

演示:
這裏寫圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章