前言
使用TCGA官方的gdc-client下載工具有時候很慢,經常會掛掉,那乾脆自己寫一個下載小程序。於是使用TCGA的API寫了個下載TCGA數據的腳本,腳本也是需要下載manifest文件的。
環境
後面有把程序打包成EXE,包含命令行的和圖形界面的,讓沒有python的同學也能用
環境:Python3.6
函數包:
- os
- pandas
- requests
- sys
- argparse
- signal
代碼
# coding:utf-8
'''
This tool is to simplify the steps to download TCGA data.The tool has two main parameters,
-m is the manifest file path.
-s is the location where the downloaded file is to be saved (it is best to create a new folder for the downloaded data).
This tool supports breakpoint resuming. After the program is interrupted, it can be restarted,and the program will download file after the last downloaded file. Note that this download tool converts the file in the past folder format directly into a txt file. The file name is the UUID of the file in the original TCGA. If necessary, press ctrl+c to terminate the program.
author: chenwi
date: 2018/07/10
mail: [email protected]
'''
import os
import pandas as pd
import requests
import sys
import argparse
import signal
print(__doc__)
requests.packages.urllib3.disable_warnings()
def download(url, file_path):
r = requests.get(url, stream=True, verify=False)
total_size = int(r.headers['content-length'])
# print(total_size)
temp_size = 0
with open(file_path, "wb") as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
temp_size += len(chunk)
f.write(chunk)
done = int(50 * temp_size / total_size)
sys.stdout.write("\r[%s%s] %d%%" % ('#' * done, ' ' * (50 - done), 100 * temp_size / total_size))
sys.stdout.flush()
print()
def get_UUID_list(manifest_path):
UUID_list = pd.read_table(manifest_path, sep='\t', encoding='utf-8')['id']
UUID_list = list(UUID_list)
return UUID_list
def get_last_UUID(file_path):
dir_list = os.listdir(file_path)
if not dir_list:
return
else:
dir_list = sorted(dir_list, key=lambda x: os.path.getmtime(os.path.join(file_path, x)))
return dir_list[-1][:-4]
def get_lastUUID_index(UUID_list, last_UUID):
for i, UUID in enumerate(UUID_list):
if UUID == last_UUID:
return i
return 0
def quit(signum, frame):
# Ctrl+C quit
print('You choose to stop me.')
exit()
print()
if __name__ == '__main__':
signal.signal(signal.SIGINT, quit)
signal.signal(signal.SIGTERM, quit)
parser = argparse.ArgumentParser()
parser.add_argument("-m", "--manifest", dest="M", type=str, default="gdc_manifest.txt",
help="gdc_manifest.txt file path")
parser.add_argument("-s", "--save", dest="S", type=str, default=os.curdir,
help="Which folder is the download file saved to?")
args = parser.parse_args()
link = r'https://api.gdc.cancer.gov/data/'
# args
manifest_path = args.M
save_path = args.S
print("Save file to {}".format(save_path))
UUID_list = get_UUID_list(manifest_path)
last_UUID = get_last_UUID(save_path)
print("Last download file {}".format(last_UUID))
last_UUID_index = get_lastUUID_index(UUID_list, last_UUID)
for UUID in UUID_list[last_UUID_index:]:
url = os.path.join(link, UUID)
file_path = os.path.join(save_path, UUID + '.txt')
download(url, file_path)
print(f'{UUID} have been downloaded')
使用方法
在命令行中命令就行:
python tcga_download.py -m manifest-xx.txt -s xxx
講解:
manifest-xx.txt 是你下載的manifest文件路徑
xxx是你下載的文件像保存到的那個文件夾(這個文件夾最好是新建的空文件夾)
演示:
將程序打包成EXE
最後對於那些沒有安裝Python的人來說,可以使用我打包好的工具tcga_download.exe來下載TCGA數據,簡單方便,有點類似gdc-client這個工具,哈哈哈,不過自己寫的還是有成就感吧,後期打算做成QT界面版本的,點點鼠標就行。
tcga_download.exe放在網盤裏了,有需要可以自行下載
鏈接:https://pan.baidu.com/s/1AGyZ5cAyPUK06zqiQGx-nQ 密碼:3os4
演示:
圖形界面的下載EXE
點點鼠標就能下載的小公舉exe:
下載地址:https://github.com/chenwi/TCGAD
演示: