Preface:許久沒有更新博客了,把老夫以往整理的技術相關,再整理下。。。在遇到大規模數據處理時,計算資源受到制約,爲此需要進行各種加速方法。數值計算加速方法有cpython、numba等,但如大規模分詞,NLP相關的處理,對文本進行處理,則不太方便,只能採用多線程、多進程的方式進行加速。。。
目錄
一、cpython計算加速
二、multiprocessing子進程加速v1
- 多個子進程,8(multiprocessing.cpu_count())核cpu,極限能快到8倍。
-
import multiprocessing from multiprocessing import Pool import os, time, random def do_anything(subprocesse_id): # 不僅僅計算,數值計算加速可用numba、cpython,針對非數值計算 result = 0 for x in range(100): for j in range(50): for k in range(10): result = x + j + k + subprocesse_id return result # 普通循環執行 cpu_count = multiprocessing.cpu_count() list1 = range(500) start = time.time() for i in list1: result = do_anything(i) end = time.time() print end - start # 根據多個子進程加速,最後通過get得到返回值 start = time.time() p = Pool(cpu_count+1) # 併發數爲cpu_count+1 result = [] for i in range(500): # 500個子進程 result.append(p.apply_async(do_anything, args=(i,))) # 切不可在此使用get,get會產生阻塞,相當於普通用法 p.close() p.join() for i in result: res = i.get() end = time.time() print end - start
-
參考:
- https://thief.one/2016/11/24/Multiprocessing子進程返回值/
- https://www.liaoxuefeng.com/wiki/897692888725344/923056295693632
-
三、multiprocessing子進程加速v2
- 使用共享變量Manager,管理多個子進程公用的變量
-
# 使用共享變量的方式 from multiprocessing import Manager def worker(procnum, return_dict): '''worker function''' print str(procnum) + ' represent!' return_dict[procnum] = procnum def do_anything(procnum, return_dict): # 不僅僅計算,數值計算加速可用numba、cpython,針對非數值計算 result = 0 for x in range(100): for j in range(50): for k in range(10): result = x + j + k + procnum return_dict[procnum] = procnum manager = Manager() return_dict = manager.dict() jobs = [] for i in range(5): p = multiprocessing.Process(target=worker, args=(i,return_dict)) jobs.append(p) p.start() for proc in jobs: proc.join() # 最後的結果是多個進程返回值的集合 print return_dict
-
- 參考:
四、numba數值計算加速
- 原理:
-
numba的通過meta模塊解析Python函數的ast語法樹,對各個變量添加相應的類型信息。然後調用llvmpy生成機器碼,最後再生成機器碼的Python調用接口。Numba項目能夠將處理NumPy數組的Python函數JIT編譯爲機器碼執行,從而上百倍的提高程序的運算速度。numba中提供了一些修飾器,它們可以將其修飾的函數JIT編譯成機器碼函數,並返回一個可在Python中調用機器碼的包裝對象。爲了能將Python函數編譯成能高速執行的機器碼,我們需要告訴JIT編譯器函數的各個參數和返回值的類型。如果希望JIT能針對所有類型的參數進行運算,可以使用autojit。
-
- 支持類型:
-
print [obj for obj in nb.__dict__.values() if isinstance(obj, nb.minivect.minitypes.Type)]
[size_t, Py_uintptr_t, uint16, complex128, float, complex256, void, int , long double,
unsigned PY_LONG_LONG, uint32, complex256, complex64, object_, npy_intp, const char *,
double, unsigned short, float, object_, float, uint64, uint32, uint8, complex128, uint16,
int, int , uint8, complex64, int8, uint64, double, long double, int32, double, long double,
char, long, unsigned char, PY_LONG_LONG, int64, int16, unsigned long, int8, int16, int32,
unsigned int, short, int64, Py_ssize_t]
-
- 例子:
-
from numba import jit import time # 普通計算 def foo1(x,y): tt = time.time() s = 0 for i in range(x,y): s += i print('Time used: {} sec'.format(time.time()-tt)) return s print(foo1(1,100000000)) # 使用numba加速 @jit def foo2(x,y): tt = time.time() s = 0 for i in range(x,y): s += i print('Time used: {} sec'.format(time.time()-tt)) return s print(foo2(1,100000000))
-
五、爬蟲多線程加速
-
爬取貼吧的多線程:僅限於爬蟲的多線程加速,因爲爬蟲IO等待時間長,爲此,多線程可破。
參考我的博客:https://blog.csdn.net/u010454729/article/details/49765929
-
#!/usr/bin/env python # coding=utf-8 from multiprocessing.dummy import Pool as ThreadPool import requests import time def getsource(url): html = requests.get(url) urls = [] for i in range(1,21): newpage = "http://tieba.baidu.com/p/3522395718?pn=" + str(i) urls.append(newpage)#構造url列表 time1 = time.time() for i in urls: print i getsource(i) time2 = time.time() print u"單線程耗時:"+str(time2-time1) pool = ThreadPool(4)#機器是多少核便填多少,滷煮實在ubuntu14.04 4核戴爾電腦上跑的程序 time3 = time.time() results = pool.map(getsource, urls) pool.close() pool.join() time4 = time.time() print u"並行耗時:"+str(time4-time3)
錯誤的加速例子:
-
from multiprocessing.dummy import Pool as ThreadPool import os # 準備函數 def getWords(filename): res = [] with open(filename, 'r') as f: for line in f: res.append(line.strip()) return '\n'.join(res) # 準備列表 file_path = 'tmpdata/' file_name_list = [] for home, dirs, files in os.walk(file_path): for file_name in files: file_name_list.append(os.path.join(home, file_name)) # 多線程執行 pool = ThreadPool(4) results = pool.map(getWords, file_name_list) # 查看返回的結果 print results[0] # 關閉線程 pool.close() pool.join()
-