最近在用Caffe_Windows做CNN分類識別。先前數據採集這塊不是由我負責的,今天突然也想把這塊跑通,這樣後面就可以玩一些自己的想要的識別了。由於CNN training Datasets特別重要,抓取數據必不可少。
例程數據集:wget -c https://storage.googleapis.com/openimages/2016_08/images_2016_08_v5.tar.gz
首先查看一下該數據集:
# -*- coding : utf-8 -*- import csv import os from urllib import request file = open('./validation/images.csv', 'r', encoding='gb18030', errors='ignore') imagereader = csv.DictReader(file) for item in imagereader: print(item)
這裏特意選擇DictReader,而不是reader,返回dict類型,便於操作,部分結果如下:
這樣我們需要下載圖片的話,通過調用
item['OriginalURL']
就可以了。
初步實現代碼:
for item in imagereader: # print(item) filename = item['OriginalURL'].split('/')[-1] for url in item['OriginalURL'].split('\n'): print("Download:", url) renum = 3 while os.path.exists(filename) == False and renum > 0: try: web = request.urlopen(url, timeout=3) img = open(filename, 'wb') img.write(web.read()) img.close() break except IOError as e: print(e) renum -= 1又加了文件查重以及timeout。測試顯示速度很慢
爲了提高效率,使用多線程:
# -*- coding : utf-8 -*- import csv import os from urllib import request import threading file = open('./validation/images.csv', 'r', encoding='gb18030', errors='ignore') class CsvReaderImage(threading.Thread): def __init__(self): threading.Thread.__init__(self) self._file = file def action(self): imagereader = csv.DictReader(self._file) for item in imagereader: # print(item) filename = item['OriginalURL'].split('/')[-1] for url in item['OriginalURL'].split('\n'): print("Download:", url) renum = 3 while os.path.exists(filename) == False and renum > 0: try: web = request.urlopen(url, timeout=3) img = open(filename, 'wb') img.write(web.read()) img.close() break except IOError as e: print(e) renum -= 1 if __name__ == '__main__': for _ in range(3): D = CsvReaderImage() D.action()下載結果:
體會:
雖然功能實現了,但是還有考慮不足的地方,比如避免重複下載,需要添加cache;如何斷點續傳等,後面找時間再優化完善吧。